Noise Contrastive Estimation and Negative Sampling for Conditional Models: Consistency and Statistical Efficiency

Noise Contrastive Estimation (NCE) is a powerful parameter estimation method for log-linear models, which avoids calculation of the partition function or its derivatives at each training step, a computationally demanding step in many cases. It is closely related to negative sampling methods, now widely used in NLP. This paper considers NCE-based estimation of conditional models. Conditional models are frequently encountered in practice; however there has not been a rigorous theoretical analysis of NCE in this setting, and we will argue there are subtle but important questions when generalizing NCE to the conditional case. In particular, we analyze two variants of NCE for conditional models: one based on a classification objective, the other based on a ranking objective. We show that the ranking-based variant of NCE gives consistent parameter estimates under weaker assumptions than the classification-based method; we analyze the statistical efficiency of the ranking-based and classification-based variants of NCE; finally we describe experiments on synthetic data and language modeling showing the effectiveness and tradeoffs of both methods.

negative sampling methods, now widely used in NLP. This paper considers NCE-based estimation of conditional models. Conditional models are frequently encountered in practice; however there has not been a rigorous theoretical analysis of NCE in this setting, and we will argue there are subtle but important questions when generalizing NCE to the conditional case. In particular, we analyze two variants of NCE for conditional models: one based on a classification objective, the other based on a ranking objective. We show that the rankingbased variant of NCE gives consistent parameter estimates under weaker assumptions than the classification-based method; we analyze the statistical efficiency of the ranking-based and classification-based variants of NCE; finally we describe experiments on synthetic data and language modeling showing the effectiveness and trade-offs of both methods.

Introduction
This paper considers parameter estimation in conditional models of the form p(y|x; ✓) = exp (s(x, y; ✓)) Z(x; ✓) where s(x, y; ✓) is the unnormalized score of label y in conjunction with input x under parameters ✓, Y is a finite set of possible labels, and Z(x; ✓) = P y2Y exp (s(x, y; ✓)) is the partition function for input x under parameters ✓.
It is hard to overstate the importance of models of this form in NLP. In log-linear models, including both the original work on maximum-entropy models (Berger et al., 1996), and later work on conditional random fields (Lafferty et al., 2001), ⇤ Part of this work done at Google. † Work done at Google. the scoring function s(x, y; ✓) = ✓ · f (x, y) where f (x, y) 2 R d is a feature vector, and ✓ 2 R d are the parameters of the model. In more recent work on neural networks the function s(x, y; ✓) is a nonlinear function. In Word2Vec the scoring function is s(x, y; ✓) = ✓ x · ✓ 0 y where y is a word in the context of word x, and ✓ x 2 R d and ✓ 0 y 2 R d are "inside" and "outside" word embeddings x and y.
In many NLP applications the set Y is large. Maximum likelihood estimation (MLE) of the parameters ✓ requires calculation of Z(x; ✓) or its derivatives at each training step, thereby requiring a summation over all members of Y, which can be computationally expensive. This has led to many authors considering alternative methods, often referred to as "negative sampling methods", where a modified training objective is used that does not require summation over Y on each example. Instead negative examples are drawn from some distribution, and a objective function is derived based on binary classification or ranking. Prominent examples are the binary objective used in word2vec ( (Mikolov et al., 2013), see also (Levy and Goldberg, 2014)), and the Noise Contrastive Estimation methods of (Mnih and Teh, 2012;Jozefowicz et al., 2016) for estimation of language models.
In spite of the centrality of negative sampling methods, they are arguably not well understood from a theoretical standpoint. There are clear connections to noise contrastive estimation (NCE) (Gutmann and Hyvärinen, 2012), a negative sampling method for parameter estimation in joint models of the form (2) However there has not been a rigorous theoretical analysis of NCE in the estimation of conditional models of the form in Eq. 1, and we will argue there are subtle but important questions when generalizing NCE to the conditional case. In particular, the joint model in Eq 2 has a single partition function Z(✓) which is estimated as a param-eter of the model (Gutmann and Hyvärinen, 2012) whereas the conditional model in Eq 1 has a separate partition function Z(x; ✓) for each value of x. This difference is critical.
We show the following (throughout we define K 1 to be the number of negative examples sampled per training example): • For any K 1, a binary classification variant of NCE, as used by (Mnih and Teh, 2012;Mikolov et al., 2013), gives consistent parameter estimates under the assumption that Z(x; ✓) is constant with respect to x (i.e., Z(x; ✓) = H(✓) for some function H). Equivalently, the method is consistent under the assumption that the function s(x, y; ✓) is powerful enough to incorporate log Z(x; ✓).
• For any K 1, a ranking-based variant of NCE, as used by (Jozefowicz et al., 2016), gives consistent parameter estimates under the much weaker assumption that Z(x; ✓) can vary with x. Equivalently, there is no need for s(x, y; ✓) to be powerful enough to incorporate log Z(x; ✓).
• We analyze the statistical efficiency of the ranking-based and classification-based NCE variants. Under respective assumptions, both variants achieve Fisher efficiency (the same asymptotic mean square error as the MLE) as K ! 1.
• We discuss application of our results to approaches of (Mnih and Teh, 2012;Mikolov et al., 2013;Levy and Goldberg, 2014;Jozefowicz et al., 2016) giving a unified account of these methods.
• We describe experiments on synthetic data and language modeling evaluating the effectiveness of the two NCE variants.

Basic Assumptions
We assume the following setup throughout: • We have sets X and Y, where X , Y are finite.
• There is some unknown joint distribution p X,Y (x, y) where x 2 X and y 2 Y. We assume that the marginal distributions satisfy p We have a scoring function s(x, y; ✓) where ✓ are the parameters of the model. For example, s(x, y; ✓) may be defined by a neural network.
• We use ⇥ to refer to the parameter space. We assume that ⇥ ✓ R d for some integer d.
• We use p N (y) to refer to a distribution from which negative examples are drawn in the NCE approach. We assume that p N satisfies p N (y) > 0 for all y 2 Y.
We will consider estimation under the following two assumptions: Assumption 2.1 There exists some parameter value ✓ ⇤ 2 ⇥ such that for all (x, y) 2 X ⇥ Y, Assumption 2.2 There exists some parameter value ✓ ⇤ 2 ⇥, and a constant ⇤ 2 R, such that for all (x, y) 2 X ⇥ Y, Assumption 2.2 is stronger than Assumption 2.1. It requires log Z(x; ✓ ⇤ ) ⌘ ⇤ for all x 2 X , that is, the conditional distribution is perfectly self-normalized. Under Assumption 2.2, it must be the case that 8x 2 X There are |X | constraints but only d + 1 free parameters. Therefore self-normalization is a nontrivial assumption when |X | d. In the case of language modeling, |X | = |V | k d + 1, where |V | is the vocabulary size and k is the length of the context. The number of constraints grows exponentially fast.
Given a scoring function s(x, y; ✓) that satisfies assumption 2.1, we can derive a scoring function s 0 that satisfies assumption 2.2 by defining where c x 2 R is a parameter for history x. Thus we introduce a new parameter c x for each possible history x. This is the most straightforward extension of NCE to the conditional case; it is used by (Mnih and Teh, 2012). It has the clear drawback however of introducing a large number of additional parameters to the model. for binary objective, and L n R for ranking objective. Binary objective essentially corresponds to a problem where the scoring function s(x, y; ✓) is used to construct a binary classifier that discriminates between positive and negative examples. Ranking objective corresponds to a problem where the scoring function s(x, y; ✓) is used to rank the true label y (i) above negative examples y (i,1) . . . y (i,K) for the input x (i) .

Two Estimation Algorithms
Our main result is as follows: Theorem 3.1 (Informal: see section 4 for a formal statement.) For any K 1, the binary classification-based algorithm in figure 1 is consistent under Assumption 2.2, but is not always consistent under the weaker Assumption 2.1. For any K 1, the ranking-based algorithm in figure 1 is consistent under either Assumption 2.1 or Assumption 2.2. Both algorithms achieve the same statistical efficiency as the maximum-likelihood estimate as K ! 1.
The remainder of this section gives a sketch of the argument underlying consistency, and discusses use of the two algorithms in previous work.

A Sketch of the Consistency Argument
for the Ranking-Based Algorithm In this section, in order to develop intuition underlying the ranking algorithm, we give a proof sketch of the following theorem: This theorem is key to the consistency argument. Intuitively as n increases L n R (✓) converges to L 1 R (✓), and the output to the algorithm converges to ✓ 0 such that p(y|x; ✓ 0 ) = p Y |X (y|x) for all x, y. Section 4 gives a formal argument.
, sampling distribution pN (·) for generating negative examples, an integer K specifying the number of negative examples per training example, a scoring function s(x, y; ✓). Flags {BINARY = true, RANKING = false} if binary classification objective is used, {BINARY = false, RANKING = true} if ranking objective is used.
Definitions: Defines(x, y; ✓) = s(x, y; ✓) log pN (y) Algorithm: from the distribution pN (y). For convenience define ; ✓)) , and the estimator b • If BINARY, define the binary objective function Figure 1: Two NCE-based estimation algorithms, using ranking objective and binary objective respectively.
The proof of theorem 3.2 rests on two identities. The first identity states that the objective function is the expectation of the negative crossentropy w.r.t. the density function 1 K+1 ↵(x,ȳ) (see Section B.1.1 of the supplementary material for derivation): The second identity concerns the relationship between q(·|x,ȳ; ✓) and (·|x,ȳ). Under assump- It follows immediately through the properties of negative cross entropy that The remainder of the argument is as follows: • Eqs. 7 and 5 imply that ). See the proof of lemma B.3 in the supplementary material.
In summary, the identity in Eq. 5 is key: the objective function in the limit, L 1 R (✓), is related to a negative cross-entropy between the underlying distribution (·|x,ȳ) and a distribution under the parameters, q(·|x,ȳ; ✓). The parameters ✓ ⇤ maximize this negative cross entropy over the space of all distributions {q(·|x,ȳ; ✓), ✓ 2 ⇥}.

The Algorithms in Previous Work
To motivate the importance of the two algorithms, we now discuss their application in previous work. Mnih and Teh (2012) consider language modeling, where x = w 1 w 2 . . . w n 1 is a history consisting of the previous n 1 words, and y is a word. The scoring function is defined as where r w i is an embedding (vector of parameters) for history word w i , q y is an embedding (vector of parameters) for word y, each C i for i = 1 . . . n 1 is a matrix of parameters specifying the contribution of r w i to the history representation, b y is a bias term for word y, and c x is a parameter corresponding to the log normalization term for history x. Thus each history x has its own parameter c x . The binary objective function is used in the NCE algorithm. The noise distribution p N (y) is set to be the unigram distribution over words in the vocabulary.
This method is a direct application of the original NCE method to conditional estimation, through introduction of the parameters c x corresponding to normalization terms for each history. Interestingly, Mnih and Teh (2012) acknowledge the difficulties in maintaining a separate parameter c x for each history, and set c x = 0 for all x, noting that empirically this works well, but without giving justification. Mikolov et al. (2013) consider an NCE-based method using the binary objective function for estimation of word embeddings. The skip-gram method described in the paper corresponds to a model where x is a word, and y is a word in the context. The vector v x is the embedding for word x, and the vector v 0 y is an embedding for word y (separate embeddings are used for x and y). The method they describe uses The negative-sampling distribution p N (y) was chosen as the unigram distribution p Y (y) raised to the power 3/4. The end goal of the method was to learn useful embeddings v w and v 0 w for each word in the vocabulary; however the method gives a consistent estimate for a model of the form assuming that Assumption 2.2 holds, i.e.
Levy and Goldberg (2014) make a connection between the NCE-based method of (Mikolov et al., 2013), and factorization of a matrix of pointwise mutual information (PMI) values of (x, y) pairs. Consistency of the NCE-based method under assumption 2.2 implies a similar result, specifically: if we define p N (y) = p Y (y), and define s(x, y; converge to values such that That is, following (Levy and Goldberg, 2014), the inner product v 0 y · v x is an estimate of the PMI up to a constant offset H(✓).
Finally, Jozefowicz et al. (2016) introduce the ranking-based variant of NCE for the language modeling problem. This is the same as the ranking-based algorithm in figure 1. They do not, however, make the connection to assumptions 2.2 and 2.1, or derive the consistency or efficiency results in the current paper. Jozefowicz et al. (2016) partially motivate the ranking-based variant throught the importance sampling viewpoint of Bengio and Senécal (2008). However there are two critical differences: 1) the algorithm of Bengio and Senécal (2008) does not lead to the same objective L n R in the ranking-based variant of NCE; instead it uses importance sampling to derive an objective that is similar but not identical; 2) the importance sampling method leads to a biased estimate of the gradients of the log-likelihood function, with the bias going to zero only as K ! 1. In contrast the theorems in the current paper show that the NCE-based methods are consistent for any value of K. In summary, while it is tempting to view the ranking variant of NCE as an importance sampling method, the NCE-based view gives stronger guarantees for finite values of K.

Theory
This section states the main theorems. The supplementary material contains proofs. Throughout the paper, we use E We use k · k to denote either the l 2 norm when the operand is a vector or the spectral norm when the operand is a matrix. Finally, we use ) to represent converge in distribution. Recall that we have defined s(x, y; ✓) = s(x, y; ✓) log p N (y).

Ranking
In this section, we study noise contrastive estimation with ranking objective under Assumption 2.1. First consider the following function: By straightforward calculation, one can find that . Under mild conditions, L n R (✓) converges to L 1 R (✓) as n ! 1. Denote the set of maximizers of L 1 R (✓) by ⇥ ⇤ R , that is The following theorem shows that any parameter vector✓ 2 ⇥ ⇤ R if and only if it gives the correct conditional distribution p Y |X (y|x).  In addition, ⇥ ⇤ R is a singleton if and only if Assumption 4.1 holds.
Next we consider consistency of the estimation algorithm based on the ranking objective under the following regularity assumptions: Assumption 4.2 (Continuity). s(x, y; ✓) is continuous w.r.t. ✓ for all (x, y) 2 X ⇥ Y.
Theorem 4.2 (Consistency) Under Assumptions 2.1, 4.2, 4.3, the estimates based on the ranking objective are strongly consistent in the sense that for any fixed K 1, P n lim Further, if Assumption 4.1 holds, P n lim Remark 4.1 Thoughout the paper, all NCE estimators are defined for some fixed K. We suppress the dependence on K to simplify notation (e.g. b ✓ n R should be interpreted as b ✓ n,K R ).

Classification
Now we turn to the analysis of NCE with binary objective under Assumption 2.2. First consider the following function, One can find that Denote the set of maximizers of L 1 B (✓, ) by ⌦ ⇤ B : Parallel results of Theorem 4.1, 4.2 are established as follows.
Assumption 4.4 (Identifiability). For any ✓ 2 ⇥, if there exists some constant c such that s(x, y; ✓) s(x, y; ✓ ⇤ ) ⌘ c for all (x, y) 2 X ⇥Y, then ✓ = ✓ ⇤ and thus c = 0. Similarly we can define the sequence of es- . . based on the binary objective.
Theorem 4.4 (Consistency) Under Assumption 2.2, 4.2, 4.5, the estimates defined by the binary objective are strongly consistent in the sense that for any K 1, P n lim If further Assumption 4.4 holds, P n lim

Asymptotic Normality and Statistical
Efficiency Noise Contrastive Estimation significantly reduces the computational complexity, especially when the label space |Y| is large. It is natural to ask: does such scalability come at a cost? Classical likelihood theory tells us, under mild conditions, the maximum likelihood estimator (MLE) has nice properties like asymptotic normality and Fisher efficiency. More specifically, as the sample size goes to infinity, the distribution of the MLE will converge to a multivariate normal distribution, and the mean square error of the MLE will achieve the Cramer-Rao lower bound (Ferguson, 1996).
We have shown the consistency of the NCE estimators in Theorem 4.2 and Theorem 4.4. In this part of the paper, we derive their asymptotic distribution and quantify their statistical efficiency. To this end, we restrict ourselves to the case where ✓ ⇤ is identifiable (i.e. Assumptions 4.1 or 4.4 hold) and the scoring function s(x, y; ✓) satisfies the following smoothness condition: Assumption 4.6 (Smoothness). The scoring function s(x, y; ✓) is twice continuous differentiable w.r.t. ✓ for all (x, y) 2 X ⇥ Y.
We first introduce the following maximumlikelihood estimator.
Define the matrix ⇤ .
As shown below, I ✓ ⇤ is essentially the Fisher information matrix under the conditional model. is non-singular, as n ! 1 p For any given estimator b ✓, define the scaled asymptotic mean square error by where d is the dimension of the parameter ✓ ⇤ . Theorem 4.5 implies that, where Tr(·) denotes the trace of a matrix. According to classical MLE theory (Ferguson, 1996), under certain regularity conditions, this is the best achievable mean square error. So the next question to answer is: can these NCE estimators approach this limit? |s(x, y; ✓ ⇤ )|, kr ✓ s(x, y; ✓ ⇤ )k , where min (·) denotes the smallest singular value.
Theorem 4.6 (Ranking) Under Assumption 2. 1, 4.1, 4.3, 4.6, 4.7, there exists an integer K 0 such that for all K K 0 , as n ! 1 p n for some matrix I R,K . There exists a constant C such that for all K K 0 , 4.4, 4.5, 4.6, 4.7, there exists an integer K 0 such that, for any K K 0 , as n ! 1 p n for some matrix I B,K . There exists a constant C such that for all K K 0 , Remark 4.2 Theorem 4.6 and 4.7 reveal that under respective model assumptions, for any given K K 0 both NCE estimators are asymptotically normal and p n-consistent. Moreover, both NCE estimators approach Fisher efficiency (statistical optimality) as K grows.

Simulations
Suppose we have a feature space X ⇢ R d with |X | = m x , label space Y = {1, · · · , m y }, and parameter ✓ = (✓ 1 , · · · , ✓ my ) 2 R my⇥d . Then for any given sample size n, we can generate observations (x (i) , y (i) ) by first sampling x (i) uniformly from X and then sampling y (i) 2 Y by the condional model We first consider the estimation of ✓ by MLE and NCE-ranking. We fix d = 4, m x = 200, m y = 100 and generate X and the parameter ✓ from separate mixtures of Gaussians. We try different configurations of (n, K) and report the KL divergence between the estimated distribution and true distribution, as summarized in the left panel of figure 2. The observations are: • The NCE estimators are consistent for any fixed K. For a fixed sample size, the NCE estimators become comparable to MLE as K increases.
• The larger the sample size, the less sensitive are the NCE estimators to K. A very small value of K seems to suffice for large sample size.
Apparently, under the parametrization above, the model is not self-normalized. To use NCEbinary, we add an extra x-dependent bias parameter b x to the score function (i.e. s(x, y; ✓) = x 0 ✓ y + b x ) to make the model self-normalized or else the algorithm will not be consistent. Similar patterns to figure 2 are observed when varying sample size and K (see Section A.1 of the supplementary material). However this makes NCE-binary not directly comparable to NCE-ranking/MLE since its performance will be compromised by estimating extra parameters and the number of extra parameters depends on the richness of the feature space X . To make this clear, we fix n = 16000, d = 4, m y = 100, K = 32 and experiment with m x = 100, 200, 300, 400. The results are summarized on the right panel of figure 2. As |X | increases, the KL divergence will grow while the performance of NCE-ranking/MLE is independent of |X |. Without the x-dependent bias term for NCE-binary, the KL divergence will be much higher due to lack of consistency (0.19, 0.21, 0.24, 0.26 respectively).

Language Modeling
We evaluate the performance of the two NCE algorithms on a language modeling problem, using the Penn Treebank (PTB) dataset (Marcus et al., 1993). We choose (Zaremba et al., 2014) as the benchmark where the conditional distribution is modeled by two-layer LSTMs and the parameters are estimated by MLE (note that the current state-of-the-art is (Yang et al., 2018)). Zaremba et al. (2014) implemented 3 model configurations: "Small" , "Medium" and "Large", which have 200, 650 and 1500 units per layer respectively. We follow their setup (model size, unrolled steps, dropout ratio, etc) but train the model by maximiz-  ing the two NCE objectives. We use the unigram distribution as the negative sampling distribution and consider K = 200, 400, 800, 1600.
The results on the test set are summarized in table 1. Similar patterns are observed on the validation set (see Section A.2 of the supplementary material). As shown in the table, the performance of NCE-ranking and NCE-binary improves as the number of negative examples increases, and finally outperforms the MLE.
An interesting observation is, without regularization, the binary classification approach outperforms both ranking and MLE. This suggests the model space (two-layer LSTMs) is rich enough as to approximately incorporate the x-dependent partition function Z(✓; x), thus making the model approximately self-normalized. This motivates us to modify the ranking and MLE objectives by adding the following regularization term: where e y (i,j) , 1  j  m are sampled from the noise distribution p N (·). This regularization term promotes a constant partition function, that is Z(x; ✓) ⇡ 1 for all x 2 X . In our experiments, we fix m to be 1/10 of the vocabulary size, K = 1600 and tune the regularization parameter ↵. As shown in the last three rows of the table, regularization significantly improves the performance of both the ranking approach and the MLE.

Conclusions
In this paper we have analyzed binary and ranking variants of NCE for estimation of conditional models p(y|x; ✓). The ranking-based variant is consistent for a broader class of models than the binary-based algorithm. Both algorithms achieve Fisher efficiency as the number of negative examples increases. Experiments show that both algorithms outperform MLE on a language modeling task. The ranking-based variant of NCE outperforms the binary-based variant once a regularizer is introduced that encourages self-normalization.