A Binarized Neural Network Joint Model for Machine Translation

The neural network joint model (NNJM), which augments the neural network language model (NNLM) with an m -word source context window, has achieved large gains in machine translation accuracy, but also has problems with high normalization cost when using large vocabularies. Training the NNJM with noise-contrastive estimation (NCE), instead of standard maximum likelihood estimation (MLE), can reduce computation cost. In this paper, we propose an alternative to NCE, the bina-rized NNJM (BNNJM), which learns a binary classiﬁer that takes both the context and target words as input, and can be ef-ﬁciently trained using MLE. We compare the BNNJM and NNJM trained by NCE on various translation tasks.

Notably, Devlin et al. (2014) proposed a neural network joint model (NNJM), which augments the n-gram neural network language model (NNLM) with an m-word source context window, as shown in Figure 1a. While this model is effective, the computation cost of using it in a large-vocabulary SMT task is quite expensive, as probabilities need to be normalized over the entire vocabulary. To solve this problem, Devlin et al. (2014) presented a technique to train the NNJM to be selfnormalized and avoided the expensive normalization cost during decoding. However, they also note that this self-normalization technique sacrifices neural network accuracy, and the training process for the self-normalized neural network is very slow, as with standard maximum likelihood estimation (MLE).
To remedy the problem of long training times in the context of NNLMs, Vaswani et al. (2013) used a method called noise contrastive estimation (NCE). Compared with MLE, NCE does not require repeated summations over the whole vocabulary and performs nonlinear logistic regression to discriminate between the observed data and artificially generated noise.
This paper proposes an alternative framework of binarized NNJMs (BNNJM), which are similar to the NNJM, but use the current target word not as the output, but as the input of the neural network, estimating whether the target word under examination is correct or not, as shown in Figure 1b. Because the BNNJM uses the current target word as input, the information about the current target word can be combined with the context word information and processed in the hidden layers.
The BNNJM learns a simple binary classifier, given the context and target words, therefore it can be trained by MLE very efficiently. "Incorrect" target words for the BNNJM can be generated in the same way as NCE generates noise for the NNJM. We present a novel noise distribution based on translation probabilities to train the NNJM and the BNNJM efficiently.

Neural Network Joint Model
Let T = t |T | 1 be a translation of S = s |S| 1 . The NNJM (Devlin et al., 2014) defines the following probability, where target word t i is affiliated with source word s a i . Affiliation a i is derived from the word alignments using heuristics 1 . To estimate these probabilities, the NNJM uses m source context words and n − 1 target history words as input to a neural network and performs estimation of unnormalized probabilities p (t i |C) before normalizing over all words in the target vocabulary V , where C stands for source and target context words as in Equation 1. The NNJM can be trained on a word-aligned parallel corpus using standard MLE, but the cost of normalizing over the entire vocabulary to calculate the denominator in Equation 2 is quite large. Devlin et al. (2014)'s self-normalization technique can avoid normalization cost during decoding, but not during training.
NCE can be used to train NNLM-style models (Vaswani et al., 2013) to reduce training times. NCE creates a noise distribution q (t i ), selects k noise samples t i1 , ..., t ik for each t i and introduces a random variable v which is 1 for training examples and 0 for noise samples, . NCE trains the model to distinguish training data from noise by maximize the conditional likelihood, The normalization cost can be avoided by using p (t i |C) as an approximation of P (t i |C). 2

Binarized NNJM
In this paper, we propose a new framework of the binarized NNJM (BNNJM), which is similar to the NNJM but learns not to predict the next word given the context, but solves a binary classification problem by adding a variable v ∈ {0, 1} that stands for whether the current target word t i is correctly/wrongly produced in terms of source context words s a i +(m−1)/2 a i −(m−1)/2 and target history words Because the BNNJM uses the current target word as input, the information about the current target word can be combined with the context word information and processed in the hidden layers. Thus, the hidden layers can be used to learn the difference between correct target words and noise in the BNNJM, while in the NNJM the hidden layers just contain information about context words and only the output layer can be used to discriminate between the training data and noise, giving the BNNJM more power to learn this classification problem.
We can use the BNNJM probability in translation as an approximation for the NNJM as below, As a binary classifier, the gradient for a single example in the BNNJM can be calculated efficiently by MLE without it being necessary to calculate the softmax over the full vocabulary. On the other hand, we need to create "positive" and "negative" examples for the classifier. Positive examples can be extracted directly from the word-aligned parallel corpus as s Negative examples can be generated for each positive example in the same way that NCE generates noise data as s ing NNLMs with NCE, where occur (t i ) stands for how many times t i occurs in the training corpus.

Translation Model Noise
In this paper, we propose a noise distribution specialized for translation models, such as the NNJM or BNNJM. Figure 2 gives a Chinese-to-English parallel sentence pair with word alignments to demonstrate the intuition behind our method. Focusing on s a i ="安 排", this is translated into t i ="arrange". For this positive example, UPD is allowed to sample any arbitrary noise, such as t i = "banana". However, in this case, noise t i = "banana" is not useful for model training, as constraints on possible translations given by the phrase table ensure that "安排" will never be translated into "banana". On the other hand, noise t i = "arranges" and "arrangement" are both possible translations of "安排" and therefore useful training data, that we would like our model to penalize.
Based on this intuition, we propose the use of another noise distribution that only uses t i that are possible translations of s a i , i.e., t i ∈ U (s a i ) \ {t i }, where U (s a i ) contains all target words aligned to s a i in the parallel corpus.
Because U (s a i ) may be quite large and contain many wrong translations caused by wrong alignments, "banana" may actually be included in U ("安 排"). To mitigate the effect of uncommon examples, we use a translation probability distribution (TPD) to sample noise t i from U (s a i ) \ {t i } as follows, is how many times t i is aligned to s a i in the parallel corpus.
Note that t i could be unaligned, in which case we assume that it is aligned to a special null word. Noise for unaligned words is sampled according to the TPD of the null word. If several target/source words are aligned to one source/target word, we choose to combine these target/source words as a new target/source word. 3

Setting
We evaluated the effectiveness of the proposed approach for Chinese-to-English (CE), Japanese-to-English (JE) and French-to-English (FE) translation tasks. The datasets officially provided for the patent machine translation task at NTCIR-9 (Goto et al., 2011) were used for the CE and JE tasks. The development and test sets were both provided for the CE task while only the test set was provided for the JE task. Therefore, we used the sentences from the NTCIR-8 JE test set as the development set. Word segmentation was done by BaseSeg (Zhao et al., 2006) for Chinese and Mecab 4 for Japanese. For the FE language pair, we used standard data for the WMT 2014 translation task. The training sets for CE, JE and FE tasks contain 1M, 3M and 2M sentence pairs, respectively.
For each translation task, a recent version of Moses HPB decoder (Koehn et al., 2007) with the training scripts was used as the baseline (Base). We used the default parameters for Moses, and a 5-gram language model was trained on the target side of the training corpus using the IRSTLM Toolkit 5 with improved Kneser-Ney smoothing. Feature weights were tuned by MERT (Och, 2003).
The word-aligned training set was used to learn the NNJM and the BNNJM. 6 For both NNJM and BNNJM, we set m = 7 and n = 5. The NNJM was trained by NCE using UPD and TPD as noise distributions. The BNNJM was trained by standard MLE using UPD and TPD to generate negative examples.
The number of noise samples for NCE was set to be 100. For the BNNJM, we used only one negative example for each positive example in each training epoch, as the BNNJM needs to calculate 3 The processing for multiple alignments helps sample more useful negative examples for TPD, and had little effect on the translation performance when UPD was used as the noise distribution for the NNJM and the BNNJM in our preliminary experiments. 4 http://sourceforge.net/projects/mecab/files/ 5 http://hlt.fbk.eu/en/irstlm 6 Both the NNJM and the BNNJM had one hidden layer, 100 hidden nodes, input embedding dimension 50, output embedding dimension 50. A small set of training data was used as validation data. The training process was stopped when validation likelihood stopped increasing.   Table 2: Translation results. The symbol + and * represent significant differences at the p < 0.01 level against Base and NNJM+UPD, respectively. Significance tests were conducted using bootstrap resampling (Koehn, 2004). the whole neural network (not just the output layer like the NNJM) for each noise sample and thus noise computation is more expensive. However, for different epochs, we resampled the negative example for each positive example, so the BNNJM can make use of different negative examples. Table 1 shows how many epochs these two models needed and the training time for each epoch on a 10-core 3.47GHz Xeon X5690 machine. 7 Translation results are shown in Table 2.

Results and Discussion
We can see that using TPD instead of UPD as a noise distribution for the NNJM trained by NCE can speed up the training process significantly, with a small improvement in performance. But for the BNNJM, using different noise distributions affects translation performance significantly. The BNNJM with UPD does not improve over the baseline system, likely due to the small number of noise samples used in training the BNNJM, while the BNNJM with TPD achieves good performance, even better than the NNJM with TPD on the Chinese-to-English and French-to-English translation tasks.
From Table 2, the NNJM does not improve translation performance significantly on the FE task. Note that the baseline BLEU for the FE S: 该(this) 移动(movement) 持续(continued) 到(until) 寄生虫(parasite) 由(by) 两(two) 个 舌(tongues) 部 21 彼此(each other) 接触(contact) 时(where) 的 点(point) 接触(touched) 。 R: this movement is continued until the parasite is touched by the point where the two tongues 21 contact each other . T1: the mobile continues to the parasite from the two tongue 21 contacts the points of contact with each other . T2: this movement is continued until the parasite by two tongue 21 contact points of contact with each other .  task is lower than CE and JE tasks, indicating that learning is harder for the FE task than CE and JE tasks. The validation perplexities of the NNJM with UPD for CE, JE and FE tasks are 4.03, 3.49 and 8.37. Despite these difficult learning circumstances and lack of large gains for the NNJM, the BNNJM improves translations significantly for the FE task, suggesting that the BNNJM is more robust to difficult translation tasks that are hard for the NNJM. Table 3 gives Chinese-to-English translation examples to demonstrate how the BNNJM (with TPD) helps to improve translations over the NNJM (with TPD). In this case, the BNNJM helps to translate the phrase "该 移 动 持 续 到" better. Table 4 gives translation scores for these two translations calculated by the NNJM and the BN-NJM. Context words are used for predictions but not shown in the table.
As can be seen, the BNNJM prefers T 2 while the NNJM prefers T 1 . Among these predictions, the NNJM and the BNNJM predict the translation for "到" most differently. The NNJM clearly predicts that in this case "到" should be translated into "to" more than "until", likely because this example rarely occurs in the training corpus. However, the BNNJM prefers "until" more than "to", which demonstrates the BNNJM's robustness to less frequent examples.

Analysis for JE Translation Results
Finally, we examine the translation results to explore why the BNNJM with TPD did not outperform the NNJM with TPD for the JE translation task, as it did for the other translation tasks. We found that using the BNNJM instead of the NNJM on the JE task did improve translation quality significantly for infrequent words, but not for frequent words.
First, we describe how we estimate translation quality for infrequent words. Suppose we have a test set S, a reference set R and a translation set T with I sentences, The general 1-gram translation accuracy (Papineni et al., 2002) is calculated as, This general 1-gram translation accuracy does not distinguish word frequency.
We use a modified 1-gram translation accuracy that weights infrequent words more heavily, where Occur (W ij ) is how many times W ij occurs in the whole reference set. Note P c will not be 1 even in the case of completely accurate translations, but it can approximately reflect infrequent word translation accuracy, since correct frequent word translations contribute less to P c . Table 5 shows P g and P c for different translation tasks. It can be seen that the BNNJM improves infrequent word translation quality similarly for all translation tasks, but improves general translation quality less for the JE task than the other translation tasks. We conjecture that the reason why the BNNJM is less useful for frequent word translations on the JE task is the fact that the JE parallel corpus has less accurate function word alignments than other language pairs, as the  grammatical features of Japanese and English are quite different. 8 Wrong function word alignments will make noise sampling less effective and therefore lower the BNNJM performance for function word translations. Although wrong word alignments will also make noise sampling less effective for the NNJM, the BNNJM only uses one noise sample for each positive example, so wrong word alignments affect the BNNJM more than the NNJM.
6 Related Work Xu et al. (2011) proposed a method to use binary classifiers to learn NNLMs. But they also used the current target word in the output, similarly to NCE. The BNNJM uses the current target word as input, so the information about the current target word can be combined with the context word information and processed in hidden layers. Mauser et al. (2009) presented discriminative lexicon models to predict target words. They train a separate classifier for each target word, as these lexicon models use discrete representations of words and different classifiers do not share features. In contrast, the BNNJM uses real-valued vector representations of words and shares features, so we train one classifier and can use the similarity information between words.

Conclusion
This paper proposes an alternative to the NNJM, the BNNJM, which learns a binary classifier that takes both the context and target words as input and combines all useful information in the hidden layers. We also present a novel noise distribution based on translation probabilities to train the BN-NJM efficiently. With the improved noise sampling method, the BNNJM can achieve comparable performance with the NNJM and even improve the translation results over the NNJM on Chineseto-English and French-to-English translations.