Adversarial Contrastive Estimation

Learning by contrasting positive and negative samples is a general strategy adopted by many methods. Noise contrastive estimation (NCE) for word embeddings and translating embeddings for knowledge graphs are examples in NLP employing this approach. In this work, we view contrastive learning as an abstraction of all such methods and augment the negative sampler into a mixture distribution containing an adversarially learned sampler. The resulting adaptive sampler finds harder negative examples, which forces the main model to learn a better representation of the data. We evaluate our proposal on learning word embeddings, order embeddings and knowledge graph embeddings and observe both faster convergence and improved results on multiple metrics.


Introduction
Many models learn by contrasting losses on observed positive examples with those on some fictitious negative examples, trying to decrease some score on positive ones while increasing it on negative ones. There are multiple reasons why such contrastive learning approach is needed. Computational tractability is one. For instance, instead of using softmax to predict a word for learning word embeddings, noise contrastive estimation (NCE) (Dyer, 2014;Mnih and Teh, 2012) can be used in skip-gram or CBOW word embedding models (Gutmann and Hyvärinen, 2012;Mikolov et al., 2013;Mnih and Kavukcuoglu, 2013;Vaswani et al., 2013). Another reason is * authors contributed equally † Work done while author was an intern at Borealis AI modeling need, as certain assumptions are best expressed as some score or energy in margin based or un-normalized probability models (Smith and Eisner, 2005). For example, modeling entity relations as translations or variants thereof in a vector space naturally leads to a distance-based score to be minimized for observed entity-relation-entity triplets (Bordes et al., 2013). Given a scoring function, the gradient of the model's parameters on observed positive examples can be readily computed, but the negative phase requires a design decision on how to sample data. In noise contrastive estimation for word embeddings, a negative example is formed by replacing a component of a positive pair by randomly selecting a sampled word from the vocabulary, resulting in a fictitious word-context pair which would be unlikely to actually exist in the dataset. This negative sampling by corruption approach is also used in learning knowledge graph embeddings (Bordes et al., 2013;Lin et al., 2015;Ji et al., 2015;Wang et al., 2014;Trouillon et al., 2016;Yang et al., 2014;Dettmers et al., 2017), order embeddings (Vendrov et al., 2016), caption generation (Dai and Lin, 2017), etc.
Typically the corruption distribution is the same for all inputs like in skip-gram or CBOW NCE, rather than being a conditional distribution that takes into account information about the input sample under consideration. Furthermore, the corruption process usually only encodes a human prior as to what constitutes a hard negative sample, rather than being learned from data. For these two reasons, the simple fixed corruption process often yields only easy negative examples. Easy negatives are sub-optimal for learning discriminative representation as they do not force the model to find critical characteristics of observed positive data, which has been independently discovered in applications outside NLP previously (Shrivastava et al., 2016). Even if hard negatives are occasionally reached, the infrequency means slow convergence. Designing a more sophisticated corruption process could be fruitful, but requires costly trialand-error by a human expert.
In this work, we propose to augment the simple corruption noise process in various embedding models with an adversarially learned conditional distribution, forming a mixture negative sampler that adapts to the underlying data and the embedding model training progress. The resulting method is referred to as adversarial contrastive estimation (ACE). The adaptive conditional model engages in a minimax game with the primary embedding model, much like in Generative Adversarial Networks (GANs) (Goodfellow et al., 2014a), where a discriminator net (D), tries to distinguish samples produced by a generator (G) from real data (Goodfellow et al., 2014b). In ACE, the main model learns to distinguish between a real positive example and a negative sample selected by the mixture of a fixed NCE sampler and an adversarial generator. The main model and the generator takes alternating turns to update their parameters. In fact, our method can be viewed as a conditional GAN (Mirza and Osindero, 2014) on discrete inputs, with a mixture generator consisting of a learned and a fixed distribution, with additional techniques introduced to achieve stable and convergent training of embedding models.
In our proposed ACE approach, the conditional sampler finds harder negatives than NCE, while being able to gracefully fall back to NCE whenever the generator cannot find hard negatives. We demonstrate the efficacy and generality of the proposed method on three different learning tasks, word embeddings (Mikolov et al., 2013), order embeddings (Vendrov et al., 2016) and knowledge graph embeddings (Ji et al., 2015).

Background: contrastive learning
In the most general form, our method applies to supervised learning problems with a contrastive objective of the following form: where l ω (x + , y + , y − ) captures both the model with parameters ω and the loss that scores a positive tuple (x + , y + ) against a negative one (x + , y − ). E p(x + ,y + ,y − ) (.) denotes expectation with respect to some joint distribution over positive and negative samples. Furthermore, by the law of total expectation, and the fact that given x + , the negative sampling is not dependent on the positive label, i.e. p(y + , y − |x + ) = p(y + |x + )p(y − |x + ), Eq. 1 can be re-written as

Separable loss
In the case where the loss decomposes into a sum of scores on positive and negative tuples such as (3) where we moved the + and − to p for notational brevity. Learning by stochastic gradient descent aims to adjust ω to pushing down s ω (x, y) on samples from p + while pushing ups ω (x, y) on samples from p − . Note that for generality, the scoring function for negative samples, denoted bỹ s ω , could be slightly different from s ω . For instance,s could contain a margin as in the case of Order Embeddings in Sec. 4.2.

Non separable loss
Eq. 1 is the general form that we would like to consider because for certain problems, the loss function cannot be separated into sums of terms containing only positive (x + , y + ) and terms with negatives (x + , y − ). An example of such a nonseparable loss is the triplet ranking loss (Schroff et al., 2015): l ω = max(0, η + s ω (x + , y + ) − s ω (x + , y − )), which does not decompose due to the rectification.

Noise contrastive estimation
The typical NCE approach in tasks such as word embeddings (Mikolov et al., 2013), order embeddings (Vendrov et al., 2016), and knowledge graph embeddings can be viewed as a special case of Eq. 2 by taking p(y − |x + ) to be some unconditional p nce (y).
This leads to efficient computation during training, however, p nce (y) sacrifices the sampling efficiency of learning as the negatives produced using a fixed distribution are not tailored toward x + , and as a result are not necessarily hard negative examples. Thus, the model is not forced to discover discriminative representation of observed positive data. As training progresses, more and more negative examples are correctly learned, the probability of drawing a hard negative example diminishes further, causing slow convergence.

Adversarial mixture noise
To remedy the above mentioned problem of a fixed unconditional negative sampler, we propose to augment it into a mixture one, λp nce (y) + (1 − λ)g θ (y|x), where g θ is a conditional distribution with a learnable parameter θ and λ is a hyperparameter. The objective in Expression. 2 can then be written as (conditioned on x for notational brevity): We learn (ω, θ) in a GAN-style minimax game: The embedding model behind l ω (x, y + , y − ) is similar to the discriminator in (conditional) GAN (or critic in Wasserstein  or Energy-based GAN (Zhao et al., 2016), while g θ (y|x) acts as the generator. Henceforth, we will use the term discriminator (D) and embedding model interchangeably, and refer to g θ as the generator.

Learning the generator
There is one important distinction to typical GAN: g θ (y|x) defines a categorical distribution over possible y values, and samples are drawn accordingly; in contrast to typical GAN over continuous data space such as images, where samples are generated by an implicit generative model that warps noise vectors into data points. Due to the discrete sampling step, g θ cannot learn by receiving gradient through the discriminator. One possible solution is to use the Gumbel-softmax reparametrization trick (Jang et al., 2016;Maddison et al., 2016), which gives a differentiable approximation. However, this differentiability comes at the cost of drawing N Gumbel samples per each categorical sample, where N is the number of categories. For word embeddings, N is the vocabulary size, and for knowledge graph embeddings, N is the number of entities, both leading to infeasible computational requirements.
Instead, we use the REINFORCE (Williams, 1992) gradient estimator for ∇ θ L(θ, x): where the expectation E is with respect to p(y + , y − |x) = p(y + |x)g θ (y − |x), and the discriminator loss l ω (x, y + , y − ) acts as the reward.
With a separable loss, the (conditional) value function of the minimax game is: and only the last term depends on the generator parameter ω. Hence, with a separable loss, the reward is −s(x + , y − ). This reduction does not happen with a non-separable loss, and we have to use l ω (x, y + , y − ).

Entropy and training stability
GAN training can suffer from instability and degeneracy where the generator probability mass collapses to a few modes or points. Much work has been done to stabilize GAN training in the continuous case Gulrajani et al., 2017;Cao et al., 2018). In ACE, if the generator g θ probability mass collapses to a few candidates, then after the discriminator successfully learns about these negatives, g θ cannot adapt to select new hard negatives, because the REIN-FORCE gradient estimator Eq. 6 relies on g θ being able to explore other candidates during sampling. Therefore, if the g θ probability mass collapses, instead of leading to oscillation as in typical GAN, the min-max game in ACE reaches an equilibrium where the discriminator wins and g θ can no longer adapt, then ACE falls back to NCE since the negative sampler has another mixture component from NCE.
This behavior of gracefully falling back to NCE is more desirable than the alternative of stalled training if p − (y|x) does not have a simple p nce mixture component. However, we would still like to avoid such collapse, as the adversarial samples provide greater learning signals than NCE samples. To this end, we propose to use a regularizer to encourage the categorical distribution g θ (y|x) to have high entropy. In order to make the the regularizer interpretable and its hyperparameters easy to tune, we design the following form: where H(g θ (y|x)) is the entropy of the categorical distribution g θ (y|x), and c = log(k) is the entropy of a uniform distribution over k choices, and k is a hyper-parameter. Intuitively, R ent expresses the prior that the generator should spread its mass over more than k choices for each x.

Handling false negatives
During negative sampling, p − (y|x) could actually produce y that forms a positive pair that exists in the training set, i.e., a false negative. This possibility exists in NCE already, but since p nce is not adaptive, the probability of sampling a false negative is low. Hence in NCE, the score on this false negative (true observation) pair is pushed up less in the negative term than in the positive term. However, with the adaptive sampler, g ω (y|x), false negatives become a much more severe issue. g ω (y|x) can learn to concentrate its mass on a few false negatives, significantly canceling the learning of those observations in the positive phase. The entropy regularization reduces this problem as it forces the generator to spread its mass, hence reducing the chance of a false negative.
To further alleviate this problem, whenever computationally feasible, we apply an additional two-step technique. First, we maintain a hash map of the training data in memory, and use it to efficiently detect if a negative sample (x + , y − ) is an actual observation. If so, its contribution to the loss is given a zero weight in ω learning step. Second, to update θ in the generator learning step, the reward for false negative samples are replaced by a large penalty, so that the REINFORCE gradient update would steer g θ away from those samples. The second step is needed to prevent null computation where g θ learns to sample false negatives which are subsequently ignored by the discriminator update for ω.

Variance Reduction
The basic REINFORCE gradient estimator is poised with high variance, so in practice one often needs to apply variance reduction techniques. The most basic form of variance reduction is to subtract a baseline from the reward. As long as the baseline is not a function of actions (i.e., samples y − being drawn), the REINFORCE gradient estimator remains unbiased. More advanced gradient estimators exist that also reduce variance (Grathwohl et al., 2017;Tucker et al., 2017;Liu et al., 2018), but for simplicity we use the self-critical baseline method (Rennie et al., 2016), where the baseline is b(x) = l ω (y + , y , x), or b(x) = −s ω (y , x) in the separable loss case, and y = argmax i g θ (y i |x). In other words, the baseline is the reward of the most likely sample according to the generator.
2.7 Improving exploration in g θ by leveraging NCE samples In Sec. 2.4 we touched on the need for sufficient exploration in g θ . It is possible to also leverage negative samples from NCE to help the generator learn. This is essentially off-policy exploration in reinforcement learning since NCE samples are not drawn according to g θ (y|x). The generator learning can use importance re-weighting to leverage those samples. The resulting REIN-FORCE gradient estimator is basically the same as Eq. 6 except that the rewards are reweighted by g θ (y − |x)/p nce (y − ), and the expectation is with respect to p(y + |x)p nce (y − ). This additional offpolicy learning term provides gradient information for generator learning if g θ (y − |x) is not zero, meaning that for it to be effective in helping exploration, the generator cannot be collapsed at the first place. Hence, in practice, this term is only used to further help on top of the entropy regularization, but it does not replace it.

Related Work
Smith and Eisner (2005) proposed contrastive estimation as a way for unsupervised learning of log-linear models by taking implicit evidence from user-defined neighborhoods around observed datapoints. Gutmann and Hyvärinen (2010) introduced NCE as an alternative to the hierarchical softmax. In the works of Mnih and Teh (2012) and Mnih and Kavukcuoglu (2013), NCE is applied to log-bilinear models and Vaswani et al. (2013) applied NCE to neural probabilistic language models (Yoshua et al., 2003). Compared to these previous NCE methods that rely on simple fixed sampling heuristics, ACE uses an adaptive sampler that produces harder negatives.
In the domain of max-margin estimation for structured prediction (Taskar et al., 2005), loss augmented MAP inference plays the role of finding hard negatives (the hardest). However, this inference is only tractable in a limited class of models such structured SVM (Tsochantaridis et al., 2005). Compared to those models that use exact maximization to find the hardest negative configuration each time, the generator in ACE can be viewed as learning an approximate amortized inference network. Concurrently to this work, Tu and Gimpel (2018) proposes a very similar framework, using a learned inference network for Structured prediction energy networks (SPEN) (Belanger and McCallum, 2016).
Concurrent with our work, there have been other interests in applying the GAN to NLP problems (Fedus et al., 2018;Wang et al., 2018;Cai and Wang, 2017). Knowledge graph models naturally lend to a GAN setup, and has been the subject of study in Wang et al. (2018) and Cai and Wang (2017). These two concurrent works are most closely related to one of the three tasks on which we study ACE in this work. Besides a more general formulation that applies to problems beyond those considered in Wang et al. (2018) and Cai and Wang (2017), the techniques introduced in our work on handling false negatives and entropy regularization lead to improved experimental results as shown in Sec. 5.4.

Application of ACE on three tasks 4.1 Word Embeddings
Word embeddings learn a vector representation of words from co-occurrences in a text corpus. NCE casts this learning problem as a binary classification where the model tries to distinguish positive word and context pairs, from negative noise samples composed of word and false context pairs. The NCE objective in Skip-gram (Mikolov et al., 2013) for word embeddings is a separable loss of the form: Here, w + c is sampled from the set of true contexts and w − c ∼ Q is sampled k times from a fixed noise distribution. Mikolov et al. (2013) introduced a further simplification of NCE, called "Negative Sampling" (Dyer, 2014). With respect to our ACE framework, the difference between NCE and Negative Sampling is inconsequential, so we continue the discussion using NCE. A drawback of this sampling scheme is that it favors more common words as context. Another issue is that the negative context words are sampled in the same way, rather than tailored toward the actual target word. To apply ACE to this problem we first define the value function for the minimax game, V (D, G), as follows: with D = p(y = 1|w t , w c ) and G = g θ (w c |w t ).

Implementation details
For our experiments, we train all our models on a single pass of the May 2017 dump of the English Wikipedia with lowercased unigrams. The vocabulary size is restricted to the top 150k most frequent words when training from scratch while for finetuning we use the same vocabulary as Pennington et al. (2014), which is 400k of the most frequent words. We use 5 NCE samples for each positive sample and 1 adversarial sample in a window size of 10 and the same positive subsampling scheme proposed by Mikolov et al. (2013). Learning for both G and D uses Adam (Kingma and Ba, 2014) optimizer with its default parameters. Our conditional discriminator is modeled using the Skip-Gram architecture, which is a two layer neural network with a linear mapping between the layers. The generator network consists of an embedding layer followed by two small hidden layers, followed by an output softmax layer. The first layer of the generator shares its weights with the second embedding layer in the discriminator network, which we find really speeds up convergence as the generator does not have to relearn its own set of embeddings. The difference between the discriminator and generator is that a sigmoid nonlinearity is used after the second layer in the discriminator, while in the generator, a softmax layer is used to define a categorical distribution over negative word candidates. We find that controlling the generator entropy is critical for finetuning experiments as otherwise the generator collapses to its favorite negative sample. The word embeddings are taken to be the first dense matrix in the discriminator.

Order Embeddings Hypernym Prediction
As introduced in Vendrov et al. (2016), ordered representations over hierarchy can be learned by order embeddings. An example task for such ordered representation is hypernym prediction. A hypernym pair is a pair of concepts where the first concept is a specialization or an instance of the second.
For completeness, we briefly describe order embeddings, then analyze ACE on the hypernym prediction task. In order embeddings, each entity is represented by a vector in R N , the score for a positive ordered pair of entities (x, y) is defined by s ω (x, y) = ||max(0, y − x)|| 2 and, score for a negative ordered pair (x + , y − ) is defined bỹ s ω (x + , y − ) = max{0, η − s(x + , y − )}, where is η is the margin. Let f (u) be the embedding function which takes an entity as input and outputs an embedding vector. We define P as a set of positive pairs and N as negative pairs, the separable loss function for order embedding task is defined by:

Implementation details
Our generator for this task is just a linear fully connected softmax layer, taking an embedding vector from discriminator as input and outputting a categorical distribution over the entity set. For the discriminator, we inherit all model setting from Vendrov et al. (2016): we use 50 dimensions hidden state and bash size 1000, a learning rate of 0.01 and the Adam optimizer. For the generator, we use a batch size of 1000, a learning rate 0.01 and the Adam optimizer. We apply weight decay with rate 0.1 and entropy loss regularization as described in Sec. 2.4. We handle false negative as described in Sec. 2.5. After cross validation, variance reduction and leveraging NCE samples does not greatly affect the order embedding task.

Knowledge Graph Embeddings
Knowledge graphs contain entity and relation data of the form (head entity, relation, tail entity), and the goal is to learn from observed positive entity relations and predict missing links (a.k.a. link prediction). There have been many works on knowledge graph embeddings, e.g. TransE (Bordes et al., 2013), TransR (Lin et al., 2015), TransH (Wang et al., 2014), TransD (Ji et al., 2015), Complex (Trouillon et al., 2016), DistMult (Yang et al., 2014) and ConvE (Dettmers et al., 2017). Many of them use a contrastive learning objective. Here we take TransD as an example, and modify its noise contrastive learning to ACE, and demonstrate significant improvement in sample efficiency and link prediction results.

Implementation details
Let a positive entity-relation-entity triplet be denoted by ξ + = (h + , r + , t + ), and a negative triplet could either have its head or tail be a negative sample, i.e. ξ − = (h − , r + , t + ) or ξ − = (h + , r + , t − ). In either case, the general formulation in Sec. 2.1 still applies. The non-separable loss function takes on the form: The scoring rule is: where r is the embedding vector for r, and h ⊥ is projection of the embedding of h onto the space of r by h ⊥ = h + r p h p h, where r p and h p are projection parameters of the model. t ⊥ is defined in a similar way through parameters t, t p and r p . The form of the generator g θ (t − |r + , h + ) is chosen to be f θ (h ⊥ , h ⊥ + r), where f θ is a feedforward neural net that concatenates its two input arguments, then propagates through two hidden layers, followed by a final softmax output layer. As a function of (r + , h + ), g θ shares parameter with the discriminator, as the inputs to f θ are the embedding vectors. During generator learning, only θ is updated and the TransD model embedding parameters are frozen.

Experiments
We evaluate ACE with experiments on word embeddings, order embeddings, and knowledge graph embeddings tasks. In short, whenever the original learning objective is contrastive (all tasks except Glove fine-tuning) our results consistently show that ACE improves over NCE. In some cases, we include additional comparisons to the state-of-art results on the task to put the significance of such improvements in context: the generic ACE can often make a reasonable baseline competitive with SOTA methods that are optimized for the task.
For word embeddings, we evaluate models trained from scratch as well as fine-tuned Glove models (Pennington et al., 2014) on word similarity tasks that consist of computing the similarity   between word pairs where the ground truth is an average of human scores. We choose the Rare word dataset (Luong et al., 2013) and WordSim-353 (Finkelstein et al., 2001) by virtue of our hypothesis that ACE learns better representations for both rare and frequent words. We also qualitatively evaluate ACE word embeddings by inspecting the nearest neighbors of selected words. For the hypernym prediction task, following Vendrov et al. (2016), hypernym pairs are created from the WordNet hierarchy's transitive closure. We use the released random development split and test split from Vendrov et al. (2016), which both contain 4000 edges.
For knowledge graph embeddings, we use TransD (Ji et al., 2015) as our base model, and perform ablation study to analyze the behavior of ACE with various add-on features, and confirm that entropy regularization is crucial for good performance in ACE. We also obtain link prediction results that are competitive or superior to the stateof-arts on the WN18 dataset (Bordes et al., 2014).

Training Word Embeddings from scratch
In this experiment, we empirically observe that training word embeddings using ACE converges significantly faster than NCE after one epoch. As shown in Fig. 3 both ACE (a mixture of p nce and g θ ) and just g θ (denoted by ADV) significantly outperforms the NCE baseline, with an absolute improvement of 73.1% and 58.5% respectively on RW score. We note similar results on WordSim-353 dataset where ACE and ADV outperforms NCE by 40.4% and 45.7%. We also evaluate our model qualitatively by inspecting the nearest neighbors of selected words in Table. 1. We first present the five nearest neighbors to each word to show that both NCE and ACE models learn sensible embeddings. We then show that ACE embeddings have much better semantic relevance in a larger neighborhood (nearest neighbor 45-50).

Finetuning Word Embeddings
We take off-the-shelf pre-trained Glove embeddings which were trained using 6 billion tokens (Pennington et al., 2014) and fine-tune them using our algorithm. It is interesting to note that the original Glove objective does not fit into the contrastive learning framework, but nonetheless we find that they benefit from ACE. In fact, we observe that training such that 75% of the words appear as positive contexts is sufficient to beat the largest dimensionality pre-trained Glove model on word similarity tasks. We evaluate our performance on the Rare Word and WordSim353 data. As can be seen from our results in Table 2, ACE on RW is not always better and for the 100d and 300d Glove embeddings is marginally worse. However, on WordSim353 ACE does considerably better across the board to the point where 50d Glove embeddings outperform the 300d baseline Glove model.

Hypernym Prediction
As shown in Table 3, with ACE training, our method achieves a 1.5% improvement on accu-   For finetuned models we recomputed the scores based on the publicly available 6B tokens Glove models and we finetuned until roughly 75% of the vocabulary was seen.
racy over Vendrov et al. (2016) without tunning any of the discriminator's hyperparameters. We further report training curve in Fig. 1, we report loss curve on randomly sampled pairs. We stress that in the ACE model, we train random pairs and generator generated pairs jointly, as shown in Fig.  2, hard negatives help the order embedding model converges faster.

Ablation Study and Improving TransD
To analyze different aspects of ACE, we perform an ablation study on the knowledge graph embedding task. As described in Sec. 4.3, the base Method Accuracy (%) order-embeddings 90.6 order-embeddings + Our ACE 92.0 Table 3: Order Embedding Performance model (discriminator) we apply ACE to is TransD (Ji et al., 2015). Fig. 5 shows validation performance as training progresses. All variants of ACE converges to better results than base NCE. Among ACE variants, all methods that include entropy regularization significantly outperform without entropy regularization. Without the self critical baseline variance reduction, learning could progress faster at the beginning but the final performance suffers slightly. The best performance is obtained without the additional off-policy learning of the generator. Table. 4 shows the final test results on WN18 link prediction task. It is interesting to note that ACE improves MRR score more significantly than hit@10. As MRR is a lot more sensitive to the top rankings, i.e., how the correct configuration ranks among the competitive alternatives, this is consistent with the fact that ACE samples hard negatives and forces the base model to learn a more discriminative representation of the positive examples.   (Trouillon et al., 2016), which achieves the SOTA on this dataset. Among all TransD based models (the best results in this group is underlined), ACE improves over basic NCE and another GAN based approach KBGAN. The gap on MRR is likely due to the difference between TransD and COMPLEX models.

Hard Negative Analysis
To better understand the effect of the adversarial samples proposed by the generator we plot the discriminator loss on both p nce and g θ samples. In this context, a harder sample means a higher loss assigned by the discriminator. Fig. 4 shows that discriminator loss for the word embedding task on g θ samples are always higher than on p nce samples, confirming that the generator is indeed sampling harder negatives. For Hypernym Prediction task, Fig.2 shows discriminator loss on negative pairs sampled from NCE and ACE respectively. The higher the loss the harder the negative pair is. As indicated in the left plot, loss on the ACE negative terms collapses faster than on the NCE negatives. After adding entropy regularization and weight decay, the generator works as expected.

Limitations
When the generator softmax is large, the current implementation of ACE training is computationally expensive. Although ACE converges faster per iteration, it may converge more slowly on wall-clock time depending on the cost of the softmax. However, embeddings are typically used as pre-trained building blocks for subsequent tasks. Thus, their learning is usually the pre-computation step for the more complex downstream models and spending more time is justified, especially with GPU acceleration. We believe that the computational cost could potentially be reduced via some existing techniques such as the "augment and reduce" variational inference of (Ruiz et al., 2018), adaptive softmax (Grave et al., 2016), or the "sparsely-gated" softmax of Shazeer et al. (2017), but leave that to future work. Another limitation is on the theoretical front. As noted in Goodfellow (2014), GAN learning does not implement maximum likelihood estimation (MLE), while NCE has MLE as an asymptotic limit. To the best of our knowledge, more distant connections between GAN and MLE training are not known, and tools for analyzing the equilibrium of a min-max game where players are parametrized by deep neural nets are currently not available to the best of our knowledge.

Conclusion
In this paper, we propose Adversarial Contrastive Estimation as a general technique for improving supervised learning problems that learn by contrasting observed and fictitious samples. Specifically, we use a generator network in a conditional GAN like setting to propose hard negative examples for our discriminator model. We find that a mixture distribution of randomly sampling negative examples along with an adaptive negative sampler leads to improved performances on a variety of embedding tasks. We validate our hypothesis that hard negative examples are critical to optimal learning and can be proposed via our ACE framework. Finally, we find that controlling the entropy of the generator through a regularization term and properly handling false negatives is crucial for successful training.