An experimental analysis of Noise-Contrastive Estimation: the noise distribution matters

Noise Contrastive Estimation (NCE) is a learning procedure that is regularly used to train neural language models, since it avoids the computational bottleneck caused by the output softmax. In this paper, we attempt to explain some of the weaknesses of this objective function, and to draw directions for further developments. Experiments on a small task show the issues raised by an unigram noise distribution, and that a context dependent noise distribution, such as the bigram distribution, can solve these issues and provide stable and data-efficient learning.


Introduction
Statistical language models (LMs) play an important role in many tasks, such as machine translation and speech recognition. Neural models, with various neural architectures (Bengio et al., 2001;Mikolov et al., 2010;Chelba et al., 2014;Józefowicz et al., 2016), have recently achieved great success. However, most of these neural architectures have a common issue: large output vocabularies cause a computational bottleneck due to the output normalization.
Different solutions have been proposed, as shortlists (Schwenk, 2007), hierarchical softmax (Morin and Bengio, 2005;Mnih and Hinton, 2009;Le et al., 2011), or self-normalisation techniques (Devlin et al., 2014;Andreas et al., 2015;Chen et al., 2016). Sampling-based techniques explore a different solution, where a limited number of negative examples are sampled to reduce the normalization cost. The resulting model is theoretically unnormalized. Apart from importance sampling (Bengio and Sénécal, 2008;Jean et al., 2015), the noise contrastive estimation (NCE) provides a simple and efficient sampling strategy, which our work focuses on.
Introduced by (Gutmann and Hyvärinen, 2010), NCE proposes an objective function that replaces the conventional log-likelihood by a binary classification task, discriminating between the real examples provided by the data, and negative examples sampled from a chosen noise distribution. This allows the model to learn indirectly from the data distribution. NCE was first applied to language modeling by (Mnih and Teh, 2012), and then to various models, often in the context of machine translation (Vaswani et al., 2013;Baltescu and Blunsom, 2015;Zoph et al., 2016). However, recently, a comparative study of methods for training large vocabulary LMs (Chen et al., 2016) highlighted the inconsistency of NCE training when dealing with very large vocabularies, showing very different perplexity results for close loss values. In another work (Józefowicz et al., 2016), NCE was shown far less data-efficient than the theoretically similar importance sampling.
In this paper, we focus on a small task to provide an in-depth analysis of the results. NCE relies on the definition of an artificial classification task that must be monitored. Indeed, using a unigram noise distribution as usually advised leads to an ineffective solution, where the model almost systematically classifies words in the noise class. This can be explained by the inability to sample rare words from the noise distribution, yielding inconsistent updates for the most frequent words. We explore other noise distributions and show that designing a more suitable classification task, with for instance a simple bigram distribution, can efficiently correct the weaknesses of NCE.

Theoretical background
A neural probabilistic language model with parameters θ outputs, for an input context H, a conditional distribution P H θ for the next word, over the vocabulary V. This conditional distribution is defined using the softmax activation function: Here, s θ (w, H) is a scoring function which depends on the network architecture. The denominator is the partition function Z(H), which is used to ensure output scores are normalized into a probability distribution.

Maximum likelihood training
Maximum likelihood training is realized by minimizing the negative log-likelihood. Parameter updates will be made using this objective gradient increasing the positive output's score, while decreasing the score of the rest of the vocabulary. Unfortunately, both output normalization and gradient computation require computation of the score for every word in V, which is the bottleneck during training, since it implies product of very large matrices (|V| being usually anywhere from tens to hundreds of thousand words).

Noise contrastive estimation
The idea behind noise contrastive estimation is to learn the relative description of the data distribution P d to a reference noise distribution P n , by learning their ratio P d /P n . This is done by drawing samples from the noise distribution and learning to discriminate between the two sets via a classification task. Considering a mixture of the data and noise distribution, for each example w with a context H from the data D, we draw k noise samples from P H n . With the logistic regression, we want to estimate the posterior probability of which class C (C = 1 for the data, C = 0 for the noise) the sample comes from. Since we want to approach the data distribution with our model of parameters θ the conditional class probabilities are: and which gives posterior class probabilities: which can be rewritten as: with: The reformulation obtained in equation 4 shows that training a classifier based on a logistic regression will estimate the log-ratio of the two distributions. This allows the learned distribution to be unnormalized, as the partition function is parametrized separately. A normalizing parameter c H is added, as following: However, this parametrization is contextdependent. In (Mnih and Teh, 2012), the authors argue that these context-dependent parameters c H can be put to zero, and that given the number of free parameters, the output scores for each context s θ 0 (•, H) will self-normalize. The objective function is given by maximizing the log-likelihood of the true example w to belong to class C = 1 and the noise samples (w n j ) 1≤j≤k to C = 0, which is, for one true example 1 : In order to obtain the global objective to maximize, we sum on all examples (H, w) ∈ D:

Experimental set-up
Noise contrastive estimation offers theoretical guarantees (Gutmann and Hyvärinen, 2010). First, the maximum for a global objective defined on an unlimited amount of data is reached for s θ * = log P d , and is the only extrema under mild conditions on the noise distribution. Secondly, the parameters that maximize our experimental objective converge to θ * in probability as the amount of data grows. Finally, as the number k of noise samples by example increases, the choice of the noise distribution P n has less impact on the estimation accuracy. Still, the noise distribution should be chosen close to the data distribution, to avoid a too simplistic classification task which could stop the learning process too early. To a certain extent, we can describe it as a trade-off between the number of samples and the effort we need to put on a 'good' noise distribution.
Considering these properties, we investigate the impact of the noise distribution on the training of language models. (Mnih and Teh, 2012) experimented with uniform and unigram distributions, while most of the subsequent literature used the unigram, excepted for (Zoph et al., 2016), who used the uniform with a very large vocabulary.
To monitor the training process with Noisecontrastive estimation, we report the average negative log-likelihood of the model, and its average log-partition function ( 1

|D|
(H,w)∈D log Z(H)). In addition to the NCE score, we consider its true data term, defined by log s θ (w,H) s θ (w,H)+kP H n (w) , which quantifies how well the model is able to recognize examples from true data as such, and can be used to estimate the posterior probabilities of each class during training (as described in equation 3).
Training was made on a relatively short English corpus (news-commentary 2012) of 4.2M words with a full vocabulary of ∼ 70K words. We trained simple feed-forward n-grams neural language models with Tensorflow (Abadi et al., 2015) 2 . Results are recorded on the training data 3 .

Experiments and Results
The first series of experiments compares different choices of noise distribution (uniform, unigram and bigram) for various vocabulary sizes (from ∼25K to the full vocabulary of ∼70K words). Figure 1 gathers the evolution of different quantities observed during the first training epoch when selecting all words appearing more than once (∼40K words). The same trend is observed for all vocabulary sizes.
For the three noise distributions, the NCE score seems to converge. However, for the unigram distribution, the log-partition function does not decrease, thus neither does the log-likelihood. Interestingly, the posterior classification probabilities shown in the third column reveal a very ineffective behaviour: almost all the positive examples are classified in the noise class.
On the contrary, the use of the uniform distribution yields more consistent results, despite the fact that it is slow to learn.
Finally, learning with the bigram noise distribution shows a very consistent behaviour with a log partition function converging steadily to zero, as well as a negative log-likelihood on par with MLE training. It is moreover very data-efficient, compared to the uniform distribution. 8.1 6.9 6.6 Bigram 6.6 6.5 6.5 6.5 Table 1: Negative log-likelihood after one epoch of training with a full vocabulary, for various noise distributions and a varying number of noise samples k Table 1 shows the negative log-likelihood reached after one epoch of training, for a varying number of noise samples. For the sake of efficiency with context-independant noise distributions, we used for these experiments the NCE implementation native to Tensorflow, for which the noise samples are re-used for all the positive examples in the training batch. While this certainly lowers the performance of the algorithm, we believe it still demonstrates how importantly the convergence speed is impacted by the number of noise samples for context-independant noise distributions, compared to the bigram distribution.
However, using the bigram distribution implies to maintain bigram counts. This can be costly with a large vocabulary size, but not prohibitive. We thus make further experiments with contextindependent noise distributions.
A common trick, when using any kind of negative sampling, is to employ a distortion coefficient 0 < α < 1 to smooth the unigram distribution, by raising every count c(w) to c(w) α , as it is done in (Mikolov et al., 2013). We can then try to get the 'good' of each distribution, which is a balance between the sampling of frequent and rare words as noise, while staying close to the data. Results are shown on figure 2. Distortion heavily influences how the model converges: being closer to the uniform distribution makes training easier, while retaining the unigram distribution's shape is still needed. This is also shown in table 1.
To get a better idea of the differences between those distributions, we first examine the ability of the models to recognize positive examples as such for a portion of the vocabulary containing the most frequent words. The two top graphs of figure 3 show that both the uniform and a distorted unigram distribution help the model to learn to classify the 1000 most frequent words, while almost no information seems to be kept on the rest (which represents ∼ 1 4 of the training data). However, the model using a distorted unigram seems a little more balanced in what it learns, for about the same average performance. The third graph shows that its log-partition function is behaving quite better, which explains the negative log-likelihood gap observed in figure 2 between these two distributions.   true examples coming from the training data as such, for the 1K most frequent words, the rest of the vocabulary, and the average, for a uniform and a unigram distribution with distortion. The bottom graph shows the two logpartition functions. Training is done on full vocabulary models, with k = 100 noise samples, on 5 epochs.
These results show how changing the shape of the noise distribution can positively affect training: using distortion allows to smooth the unigram distribution, avoiding to sample only frequent words, while reaching a better negative loglikelihood than with a uniform distribution. However, as indicated by table 1, models trained with a bigram noise distribution need far less noise samples or data.

Conclusion
Given the difficulty to train neural language models with NCE for large vocabularies, this paper aimed to get a better understanding of its mechanisms and weaknesses. Our results indicate that the theoretical trade-off between the number of noise samples and the effort we need to put on a 'good' noise distribution is verified in practice. It also impacts the quantity of training data required, and the training stability. Notably, a context dependent noise distribution yields a satisfactory classification task, along with a faster and steadier training. In the future, we project to work on an intermediate context-dependent noise distribution, which would be able to scale well with large vocabularies.