Information-Theory Interpretation of the Skip-Gram Negative-Sampling Objective Function

In this paper we define a measure of dependency between two random variables, based on the Jensen-Shannon (JS) divergence between their joint distribution and the product of their marginal distributions. Then, we show that word2vec’s skip-gram with negative sampling embedding algorithm finds the optimal low-dimensional approximation of this JS dependency measure between the words and their contexts. The gap between the optimal score and the low-dimensional approximation is demonstrated on a standard text corpus.


Introduction
Continuous word representations, derived from unlabeled text, have proven useful in many NLP tasks. Such word representations (or embeddings) associate a low-dimensional, real-valued vector with each word, typically induced via neural language models or matrix factorization.
Substantial benefit arises when embeddings can be efficiently trained on large volumes of data. Hence the recent considerable interest in the continuous bag-of-words (CBOW) and skip-gram with negative sampling (SGNS) models, described in (Mikolov et al., 2013), as implemented in the opensource toolkit word2vec. These models are based on a relatively simple log-linear method and avoid hidden layers typical to neural networks. Consequently, they can be trained to produce high-quality word embeddings on large corpora like the entirety of English Wikipedia in several hours, compared to days or even weeks in the case of other continuous models. Recent studies obtained state-of-the-art results by using skip-gram embeddings on a variety of natural language processing tasks, such as named entity extraction (Passos et al., 2014) and dependency parsing (Bansal et al., 2014). In recent years, there were several attempts to mathematically interpret word embedding models (Arora et al., 2016;Pennington et al., 2014;Stratos et al., 2015). Our study pursues this established line of work, attempting to explain the objective function of the SGNS word embedding algorithm.
In the SGNS model, the energy function takes the form of a dot product between the vectors of an observed word and an observed context. The objective function is a binary logistic regression classifier that treats a word and its observed context as a positive example, and a word and a randomly sampled context as a negative example. Levy and Goldberg (2014) offered a motivation for this function by showing that it obtains its global maximum value at the word-context pointwise mutual information (PMI) matrix. In this study, we take their analysis one step further and provide an informationtheoretical interpretation of the SGNS objective function. In Section 2, we define a new measure of mutual information between random variables based the Jensen-Shennon divergence (Lin, 1991) instead of the KL divergence. In Section 3, we show that the value of the SGNS objective computed at the PMI matrix is this information measure. We then derive an explicit expression for the information loss caused by the low-dimensional embedding learned by the SGNS algorithm. Finally, in Section 4, we illustrate this by computing the information loss caused by actual SGNS embeddings learned on a standard text corpus.
There are several standard methods of measuring the distance between two discrete probability distributions, defined on a given finite set A. The Kullback-Leibler (KL) divergence of a distribution p from a distribution q is defined as follows: KL(p||q) = i∈A p i log p i q i . The mutual information between two jointly distributed random variables X and Y is defined as the KL divergence of the joint distribution p(x, y) from the product p(x)p(y) of the marginal distributions of X and Y, i.e. I(X; Y ) = KL(p(x, y)||p(x)p(y)).
We next propose a new measure for mutualinformation using the JS-divergence between p(x, y) and p(x)p(y) instead of the KL-divergence. We define the Jensen-Shannon Mutual information (JSMI) as follows: It can be easily verified that X and Y are independent if and only if JSMI α (X, Y ) = 0.
We next derive an alternative definition of the JSMI dependency measure. Assume we choose between the two distributions, p(x, y) and the product of marginal distributions p(x)p(y), according to a binary random variable Z, such that p(Z = 1) = α. We first sample a binary value for Z and next, we sample a r.v. W as follows: (3) The divergence measure JSMI α (X, Y ) can be alternatively defined in terms of mutual information between W and Z. The mutual-information between W and Z is: Eq. (1) thus implies that: (4) Applying Bayes rule we obtain: is the sigmoid function and is a shifted version of the PMI function. Equations (4) and (5) imply that: is the binary entropy function.

The Skip-Gram Embedding Algorithm
The SGNS embedding algorithm (Mikolov et al., 2013) represents each word x and each context y as d-dimensional vectors x and y, with the purpose that words that are "similar" to each other will have similar vector representations. We can represent a given d-dimensional embedding by a matrix m, such that m(x, y) = x · y. The rank of the embedding matrix m is (at most) d.
Let p(x, y) be the normalized number of cooccurrences of word x and context-word y in a given corpus and let p(x) and p(y) be the corresponding unigram distributions. Consider a binary classifier that treats a word and its observed context as a positive example, and a word and a randomly sampled context as a negative example. The classification is made based on the embedding in such a way that the probability that (x, y) is a positive example is σ( x · y). The objective function ideally maximized by the SGNS word embedding algorithm is the expectation of the log-likelihood function of the embedding: (8) Note that the term h( 1 k+1 ), which does not appear in the original SGNS objective function (Mikolov et al., 2013), is a constant number that was added here to simplify the following presentation.
The sparsity of p(x, y) (which is obtained as normalized counts from a given learning corpus) makes it feasible to compute the second term of (8).
The number of summed-over elements in the third term of (8), however, is quadratic in the size of the vocabulary, making it hard to compute. Therefore, in practice, we can approximate the expectation by sampling of 'negative' examples. The actual SGNS score, then, is: log σ(− x t · y ti )).
(9) such that t goes over all the word-context pairs in a given corpus. The negative examples y ti are created for each pair (x t , y t ) by drawing k random contexts from the context-word distribution p(y).
As pointed out in (Levy et al., 2015), k has two distinct functions in the SGNS objective function. First, it is used to better estimate the distribution of negative examples. Second, it is used as a weight on the probability of observing a positive example versus a negative example; a higher k means that negative examples are more probable.
We can compute the SGNS score function S(m) for every real-valued matrix m = (m x,y ). Levy and Goldberg (2014) showed that the function achieves its global maximal value when for each word-pair (x, y) the inner product of the embedding vectors x · y is equal to pmi(x, y). In other words they showed that S(m) ≤ S(pmi) for every matrix m. We next show that the value of the function S(m) at its maximum point, the PMI matrix, has a concrete interpretation, namely it is exactly the Jensen-Shannon Mutual Information (JSMI) between words and their contexts.
Theorem 1: The value of the SGNS score with k negative samples (8) at the PMI matrix satisfies: such that α = 1 k+1 . Proof: It can be easily verified that by substituting α = 1 k+1 in the definition of JSMI (Eq. (7)), we exactly obtain the SGNS score (8) at the PMI matrix. 2 Levy and Goldberg (2014) showed that SGNS's objective achieves its maximal value at the PMI matrix. However, this result reveals nothing about the more interesting lower dimensional case, where the PMI matrix factorization is forced to compress the joint distribution and thereby learn a meaningful embedding. We next derive an explicit description of the approximation criterion that quantifies the gap between S(m) and S(pmi).
Given the word co-occurrences joint distribution p(x, y), we obtained in Eq. (5) a conditional distribution on the alphabet of (Z, W ) as follows: In a similar way, given any matrix m, we can define a conditional distribution p m on the alphabet of (Z, W ) as follows: p m (Z = 1|W = (x, y)) = σ(m x,y ).
Note that in the special case where m is the PMI matrix, p pmi (z|w) coincides with the original p(z|w) that was defined in Eq. (5). Theorem 2: The difference between the SGNS score at the PMI matrix and the SGNS score at a given matrix m can be written as: p m (Z = 1|x, y) The KL divergence between two distributions is always non-negative and is zero only if the two distributions are the same. Therefore, we rederive the results of (Levy and Goldberg, 2014) that S(pmi) = max m S(m). Theorem 2 can be viewed as an instance of the well-known connection between maximizing log-likelihood and minimizing KL divergence between the estimated and the true data-generating distribution. In this case, the true distribution is the pmi-based classifier p pmi (Z|W ).
Combining theorems 1 and 2 we obtain that S(m) ≤ JSMI α (X, Y ) for every low-dimensional embedding matrix. The difference JSMI α (X, Y )− S(m) is the information loss caused by the lowdimensional embedding. We can view it as a Jensen-Shannon variant of the information bottleneck principle (Tishby et al., 1999;Globerson et al., 2007) that is defined in terms of the KL divergence. The optimal d-dimensional embedding, is the best d-dimensional approximation of the JSMI dependency measure in the sense that it minimizes the information loss. The JSMI is the upper bound that any embedding can obtain. To illustrate that, in the next section we compute the JSMI between words and their contexts based on a standard text corpus and show the information gap between the JSMI and the actual SGNS score as a function of the embedding dimension d.
From Theorem 2 we can also derive an explicit information-theoretic interpretation of the score function S(m) (7) as the difference between two KL-divergence terms: The word embedding problem can be also viewed as a factorization of the PMI matrix. Previous works suggested other criteria for matrix factorization such as least-squares (Eckart and Young, 1936) and KL-divergence between the original matrix and the low-rank matrix approximation (Lee and Seung, 2000). We have shown that the SGNS algorithm factorizes the PMI matrix based on the JSMI-based criterion stated in Eq. (10).

Experiments
In this section we use word2vec to train real skipgram with negative sampling (SGNS) embedding models. By measuring the value of their objective function and comparing it against the optimal one using exact PMI values, we demonstrate how a well-trained model minimizes the difference in Eq. (10). We note that this is an intrinsic measure that does not necessarily reflect the usefulness of the learned embeddings for other tasks.
We used the Penn Tree Bank (PTB), a popular small-scale corpus, for our experiments. A version of this dataset is available from Tomas Mikolov. 1 It consists of 929K training words with a 10K word vocabulary, which we used to train our models. To learn the SGNS word embeddings, we used word2vec's default parameter values: windowsize = 5, min-count = 5, and number of negative samples k = 5. We varied the dimensionality of the embeddings and the number of training iterations performed. Once the models were trained, we measured their score (9) on the training corpus.
Based on the same learning corpus, we computed S(pmi) = JSMI α (X, Y ) for α = 1 k+1 = 1/6. Note that p(x, y) = 0 implies that pmi x,y = −∞ and therefore log σ(−pmi x,y ) = 0. Hence, as in the second term, to compute the third term of S(m) (8) for the case of m = pmi, we can sum only over the positive pairs (x, y) that actually appear in the corpus. 2 In other words, for the special case m = pmi, it is feasible to compute the exact score (8) and not just its approximation (9) that is based on negative sampling. Figure 1 illustrates the optimal PMI-based score, compared with the scores obtained by different models with varied embedding dimensionality and number of training iterations. As can be seen, the embeddings score gets close to the optimal value using higher dimensionality and more training iterations, but doesn't surpass it.

Conclusion
In this study, we developed a new correlation measure between random variables, denoted JSMI. This measure is based on the JS divergence and differs from the standard mutual information measure that is based on the KL divergence. We showed that the optimization of skip-gram embeddings with negative sampling finds the best low-dimensional approximation of the JSMI measure. Thus, we provided an information theory framework that hopefully contributes to a better understanding of this embedding algorithm. Furthermore, although we focused here on the case of word-context joint distributions, the connection we haven shown between the PMI matrix and the JSMI function is valid for every joint distribution of two random variables.