An Empirical Evaluation of Noise Contrastive Estimation for the Neural Network Joint Model of Translation

The neural network joint model of translation or NNJM (Devlin et al., 2014) combines source and target context to produce a powerful translation feature. However, its softmax layer necessitates a sum over the entire output vocabulary, which results in very slow maximum likelihood (MLE) training. This has led some groups to train using Noise Con-trastive Estimation (NCE), which side-steps this sum. We carry out the ﬁrst direct comparison of MLE and NCE training objectives for the NNJM, showing that NCE is signiﬁcantly outperformed by MLE on large-scale Arabic-English and Chinese-English translation tasks. We also show that this drop can be avoided by using a recently proposed translation noise distribution.


Introduction
The Neural Network Joint Model of Translation, or NNJM (Devlin et al., 2014), is a strong feature for statistical machine translation. The NNJM uses both target and source tokens as context for a feedforward neural network language model (LM). Unfortunately, its softmax layer requires a sum over the entire output vocabulary, which slows the calculation of LM probabilities and the maximum likelihood estimation (MLE) of model parameters. Devlin et al. (2014) address this problem at runtime only with a self-normalized MLE objective. Others advocate the use of Noise Contrastive Estimation (NCE) to train NNJMs and similar monolingual LMs (Mnih and Teh, 2012;Vaswani et al., 2013;Baltescu and Blunsom, 2015;Zhang et al., 2015). NCE avoids the sum over the output vocabulary at both train-and run-time by wrapping the NNJM inside a classifier that attempts to separate real data from sampled noise, greatly improving training speed. The training efficiency of NCE is well-documented, and will not be evaluated here. However, the experimental evidence that NCE matches MLE in terms of resulting model quality is all on monolingual language modeling tasks (Mnih and Teh, 2012). Since cross-lingual contexts provide substantially stronger signals than monolingual ones, there is reason to suspect these results may not carry over to NNJMs.
To our knowledge there is no published work that directly compares MLE and NCE in the context of an NNJM; this paper fills that gap as its primary contribution. We measure model likelihood and translation quality in large-scale Arabic-to-English and Chinese-to-English translation tasks. We also test a recently-proposed translation noise distribution for NCE (Zhang et al., 2015), along with a mixture of noise distributions. Finally, we test a widely known, but apparently undocumented, technique for domain adaptation of NNJMs, demonstrating its utility, as well as its impact on the MLE-NCE comparison.

Methods
The NNJM adds a bilingual context window to the machinery of feed-forward neural network language models, or NNLMs (Bengio et al., 2003). An NNLM calculates the probability p(e i |e i−1 i−n+1 ) of a word e i given its n − 1 preceding words, while an NNJM assumes access to a source sentence F and an aligned source index a i that points to the most influ-ential source word for the next translation choice. It calculates p(e i |e i−1 i−n+1 , f a i +m a i −m ), which accounts for 2m + 1 words of source context, centered around f a i . The two models differ only in their definition of the conditioning context, which we will generalize with the variable c i = e i−1 i−n+1 , f a i +m a i −m . When unambiguous, we drop the subscript i from e and c.
The feed-forward neural network that powers both models takes a context sequence c as input to its network, which includes an embedding layer, one or more hidden layers, and a top-level softmax layer that assigns probabilities to each word in the vocabulary V . Let s c (e) represent the unnormalized neural network score for the word e. The softmax layer first calculates Z c = e ∈V exp s c (e ), which allows it to then normalize the score into a log probability log p(e|c) = s c (e) − log Z c . Given a training set of word-context pairs, MLE training of NNJMs minimizes the negative log likelihood e,c − log p(e|c).
The problem with this objective is that calculating Z c requires a sum over the entire vocabulary, which is very expensive. This problem has received much recent study, but Devlin et al. (2014) proposed a novel solution for their NNJM, which we refer to as self-normalization. Assume we are willing to incur the cost of calculating Z c during training, which might be mitigated by special-purpose hardware such as graphical processing units (GPUs). One can modify the MLE objective to encourage log Z c to be small, so that the term can be safely dropped at run-time: where α trades self-normalization against model likelihood. Devlin et al. (2014) have shown that selfnormalization has minimal impact on model quality and a tremendous impact on run-time efficiency.

Noise Contrastive Estimation
Introduced by Gutmann and Hyvärinen (2010) and first applied to language modeling by Mnih and Teh (2012), NCE allows one to train self-normalized models without calculating Z. It does so by defining a noise distribution q over words in V , which is typically a unigram noise distribution q u . It samples k noise wordsê k 1 for each training word e, and wraps the NNJM inside a binary classifier that attempts to separate true data from noise. Let D be a binary variable that is 1 for true data and 0 for noise. We know the joint noise probability p(D = 0, e|c) = k k+1 q(e), and we can approximate the joint data probability using our neural network p(D = 1, e|c) ≈ 1 k+1 p(e|c) ≈ 1 k+1 exp s c (e). Note that the final approximation drops Z c from the calculation, improving efficiency and forcing the model to self-normalize. With these two terms in place, and a few manipulations of conditional probability, the NCE training objective can be given as: which measures the probability that data is recognized as data, and noise is recognized as noise.
Note that q ignores the context c. Previous work on monolingual language modeling indicates that a unigram proposal distribution is sufficient for NCE training (Mnih and Teh, 2012). But for bilingual NNJMs, Zhang et al. (2015) have shown that it is beneficial to have q condition on source context. Re- We experiment with a translation noise distribution q t (ê|f a i ). We estimate q t by relative frequency from our training corpus, which implicitly provides us with one e i , f a i pair for each training point e i , c i . Conditioning on f a i drastically reduces the entropy of the noise distribution, focusing training on the task of differentiating between likely translation candidates.
As our experiments will show, under NCE with translation noise, the NNJM no longer provides meaningful scores for the entire vocabulary. Therefore, we also experiment with a novel mixture noise distribution: We deviate from their configuration by using a single 512-node hidden layer, motivated by our internal development experiments. All NCE variants use k = 100 noise samples.
NNJM training data is pre-processed to limit vocabularies to 16K types for source or target inputs, and 32K types for target outputs. We build 400 deterministic word clusters for each corpus using mkcls (Och, 1999). Any word not among the 16K / 32K most frequent words is replaced with its cluster.
We train our models with mini-batch stochastic gradient descent, with a batch size of 128 words, and an initial learning rate of 0.3. We check our training objective on the development set every 20K batches, and if it fails to improve for two consecutive checks, the learning rate is halved. Training stops after 5 consecutive failed checks or after 60 checks. As NCE may take longer to converge than MLE, we occasionally let NCE models train to 90 checks, but this never resulted in improved performance. Finally, after training finishes on the complete training data, we use that model to initialize a second training run, on a smaller in-domain training set known to better match the test conditions. 1 This in-domain pass uses a lower initial learning rate of 0.03.
Our translation system is a multi-stack phrasebased decoder that is quite similar to Moses (Koehn et al., 2007). Its features include standard phrase table probabilities, KN-smoothed language models including a 6-gram model trained on the English Gigaword and a 4-gram model trained on the target side of the parallel training data, domainadapted phrase tables and language models (Foster and Kuhn, 2007), a hierarchical lexicalized reordering model (Galley and Manning, 2008), and sparse features drawn from Hopkins and May (2011) and Cherry (2013). It is tuned with a batch-lattice variant of hope-fear MIRA (Chiang et al., 2008;Cherry and Foster, 2012).

Experiments
We test two translation scenarios drawn from the recent BOLT evaluations: Arabic-to-English and Chinese-to-English. The vital statistics for our corpora are given in  NIST data with BOLT-specific informal genres. The development and test sets are focused specifically on the web-forum genre, as is the in-domain subset of the training data (In-dom). The Arabic was segmented with MADA-ARZ (Habash et al., 2013), while the Chinese was segmented with a lexiconbased approach. All data was word-aligned with IBM-4 in GIZA++ (Och and Ney, 2003), with growdiag-final-and symmetrization (Koehn et al., 2003).

Comparing Training Objectives
Our main experiment is designed to answer two questions: (1) does training NNJMs with NCE impact translation quality? and (2) can any reduction be mitigated through alternate noise distributions? To this end, we train four NNJMs.
• MLE: Maximum likelihood training with selfnormalization α = 0.1 • NCE-U: NCE with unigram noise • NCE-T: NCE with translation noise • NCE-M: NCE with mixture noise and compare their performance to that of a system with no NNJM. Each NNJM was trained as described in Section 3, varying only the learning objective. 2 To measure intrinsic NNJM quality, we report average negative log likelihoods (NLL) and average | log Z|, both calculated on Dev. Lower NLL scores indicate better prediction accuracy, while lower | log Z| values indicate more effective self-normalization. We also provide average BLEU scores and standard deviations for Test1 and Test2, each calculated over 5 random tuning replications. Statistical significance is calculated with MultEval (Clark et al., 2011).
Our results are shown in  Though NCE-T performs very well as a translation feature, it is relatively lousy as a language model, with abnormally large values for both NLL and | log Z|. This indicates that NCE-T is only good at predicting the next word from a pool of reasonable translation candidates. Scores for words drawn from the larger vocabulary are less accurate. However, the BLEU results for NCE-T show that this does not matter for translation performance. If model likelihoods over the complete vocabulary are needed, one can repair these estimates by mixing in unigram noise, as shown by NCE-M, which achieves the same or better likelihoods than NCE-U, with comparable BLEU scores to those of NCE-T. Devlin et al. (2014) suggest that one drawback of NCE with respect to self-normalized MLE is NCE's lack of an α hyper-parameter to control the objective's emphasis on self-normalization. However, the | log Z| values for NCE-U are only slightly lower than those of MLE, and are larger than those of the superior NCE-M. This suggests that we could not have improved NCE-U's performance by adjusting its emphasis on self-normalization.

Impact of the Domain Adaptation Pass
We began this project with the hypothesis that NCE may harm NNJM performance. But NCE-U performed worse than we expected. In particular, the differences between NCE-U and NCE-T are larger than those reported by Zhang et al. (2015). This led us to investigate the domain adaptation pass, which was used in our experiments but not those of Zhang et al. This step refines the model with a second training pass on an in-domain subset of the training data. We repeated our comparison for Arabic without domain adaptation, reporting BLEU averaged over two test sets and across 5 tuning replications. We also report each system's BLEU differential ∆ with respect to MLE. The results are shown under General in Table 3, while Adapted summarizes our results from Table 2 in the same format. The domain adaptation step magnifies the differences between training objectives, perhaps because it increases performance over-all. The spread between the worst and best NNJM is only 0.3 BLEU under General, while it is 0.8 BLEU under Adapted. Therefore, groups training unadapted models may not see as large drops from NCE-U as we have reported above. Note that we experimented with several configurations that account specifically for this domain-adaptation pass (noise distributions based on general versus in-domain corpora, alternate stopping criteria), so that NCE-U would be presented in the most positive possible light. Perhaps most importantly, Table 3 shows that the domain adaptation pass is quite effective, producing large improvements for all NNJMs.

Impact on Speed
MLE and NCE both produce self-normalized models, so they both have the same impact on decoding speed. With the optimizations described by Devlin et al. (2014), the impact of any single-hidden-layer NNJM is negligible.
For training, the main benefit of NCE is that it reduces the cost of the network's output layer, replacing a term that was linear in the vocabulary size with one that is linear in the sample size. In our experiments, this is a reduction from 32K to 100. The actual benefit from this reduction is highly implementation-and architecture-dependent. It is difficult to get a substantial speedup from NCE using Theano on GPU hardware, as both reward dense matrix operations, and NCE demands sparse vector operations (Jean et al., 2015). Therefore, our decision to implement all methods in a shared codebase, which ensured a fair comparison of model quality, also prevented us from providing a meaningful evaluation of training speed, as the code and architecture were implicitly optimized to favour the most demanding method (MLE). Fortunately, there is ample evidence that NCE can provide large improvements to per-batch training speeds for NNLMs, ranging from a 2× speed-up for 20K-word vocabularies on a GPU (Chen et al., 2015) to more than 10× for 70K-word vocabularies on a CPU (Vaswani et al., 2013). Meanwhile, our experiments show that 1.2M batches are sufficient for MLE, NCE-T and NCE-M to achieve very high quality; that is, none of these methods made use of early stopping during their main training pass. This indicates that per-batch speed is the most important factor when comparing the training times of these NNJMs.

Conclusions
We have shown that NCE training with a unigram noise distribution does reduce NNJM performance with respect to MLE training, both in terms of model likelihoods and downstream translation quality. This performance drop can be avoided if NCE uses a translation-aware noise distribution. We have emphasized the importance of a domain-specific training pass, and we have shown that this pass magnifies the differences between the various NNJM training objectives. In a few cases, NCE with translation noise actually outperformed MLE. This suggests that there is value in only considering plausible translation candidates during training. It would be interesting to explore methods to improve MLE with this intuition.