Deep Neural Language Models for Machine Translation

Neural language models (NLMs) have been able to improve machine translation (MT) thanks to their ability to generalize well to long contexts. Despite recent successes of deep neural networks in speech and vision, the general practice in MT is to incorporate NLMs with only one or two hidden layers and there have not been clear results on whether having more layers helps. In this paper, we demonstrate that deep NLMs with three or four layers outperform those with fewer layers in terms of both the perplexity and the translation quality. We combine various techniques to successfully train deep NLMs that jointly condition on both the source and target contexts. When reranking n - best lists of a strong web-forum baseline, our deep models yield an average boost of 0.5 T ER / 0.5 B LEU points compared to using a shallow NLM. Additionally, we adapt our models to a new sms-chat domain and obtain a similar gain of 1.0 T ER / 0.5 B LEU points. 1


Introduction
Deep neural networks (DNNs) have been successful in learning more complex functions than shallow ones (Bengio, 2009) and exceled in many challenging tasks such as in speech  and vision (Krizhevsky et al., 2012). These results have sparked interest in applying DNNs to natural language processing problems as well. Specifically, in machine translation (MT), there has been an active body of work recently in utilizing neural language models (NLMs) to improve translation quality. However, to the best of our knowledge, work in this direction only makes use of NLMs with either one or two hidden layers. For example, Schwenk (2010Schwenk ( , 2012 and Son et al. (2012) used shallow NLMs with a single hidden layer for reranking. Vaswani et al. (2013) considered two-layer NLMs for decoding but provided no comparison among models of various depths. Devlin et al. (2014) reported only a small gain when decoding with a two-layer NLM over a single layer one. There have not been clear results on whether adding more layers to NLMs helps.
In this paper, we demonstrate that deep NLMs with three or four layers are better than those with fewer layers in terms of the perplexity and the translation quality. We detail how we combine various techniques from past work to successfully train deep NLMs that condition on both the source and target contexts. When reranking nbest lists of a strong web-forum MT baseline, our deep models achieve an additional improvement of 0.5 TER / 0.5 BLEU compared to using a shallow NLM. Furthermore, by fine-tuning general indomain NLMs with out-of-domain data, we obtain a similar boost of 1.0 TER / 0.5 BLEU points over a strong domain-adapted sms-chat baseline compared to utilizing a shallow NLM.

Neural Language Models
We briefly describe the NLM architecture and training objective used in this work as well as compare our approach to other related work. Architecture. Neural language models are fundamentally feed-forward networks as described in (Bengio et al., 2003), but not necessarily limited to only a single hidden layer. Like any other language model, NLMs specify a distribution, p(w|c), to predict the next word w given a context c. The first step is to lookup embeddings for words in the context and concatenate them to form an input, h (0) , to the first hidden layer. We then repeatedly build up hidden representations as follows, for l = 1, . . . , n: where f is a non-linear fuction such as tanh. The predictive distribution, p(w|c), is then derived using the standard softmax: Objective. The typical way of training NLMs is to maximize the training data likelihood, or equivalently, to minimize the cross-entropy objective of the following form: (c,w)∈T − log p(w|c). Training NLMs can be prohibitively slow due to the computationally expensive softmax layer. As a result, past works have tried to use a more efficient version of the softmax such as the hierarchical softmax (Morin, 2005;Mnih and Hinton, 2007;Mnih and Hinton, 2009) or the classbased one (Mikolov et al., 2010;Mikolov et al., 2011). Recently, the noise-contrastive estimation (NCE) technique (Gutmann and Hyvärinen, 2012) has been applied to train NLMs in (Mnih and Teh, 2012;Vaswani et al., 2013) to avoid explicitly computing the normalization factors. Devlin et al. (2014) used a modified version of the cross-entropy objective, the self-normalized one. The idea is to not only improve the prediction, p(w|c), but also to push the normalization factor per context, Z c , close to 1: While self-normalization does not lead to speed up in training, it allows trained models to be applied efficiently at test time without computing the normalization factors. This is similar in flavor to NCE but allows for flexibility (through α) in how hard we want to "squeeze" the normalization factors.
Training deep NLMs. We follow (Devlin et al., 2014) to train self-normalized NLMs, conditioning on both the source and target contexts. Unlike (Devlin et al., 2014), we found that using the rectified linear function, max{0, x}, proposed in (Nair and Hinton, 2010), works better than tanh. The rectified linear function was used in (Vaswani et al., 2013) as well. Furthermore, while these works use a fixed learning rate throughout, we found that having a simple learning rate schedule is useful in training well-performing deep NLMs. This has also been demonstrated in (Sutskever et al., 2014;Luong et al., 2015) and is detailed in Section 3. We do not perform any gradient clipping and notice that learning is more stable when short sentences of length less than or equal to 2 are removed. Bias terms are used for all hidden layers as well as the softmax layer as described earlier, which is slightly different from other work such as (Vaswani et al., 2013). All these details contribute to our success in training deep NLMs. For simplicity, the same vocabulary is used for both the embedding and the softmax matrices. 2 In addition, we adopt the standard softmax to take advantage of GPUs in performing large matrix multiplications. All hyperparameters are given later.

Data
We use the Chinese-English bitext in the DARPA BOLT (Broad Operational Language Translation) program, with 11.1M parallel sentences (281M Chinese words and 307M English words). We reserve 585 sentences for validation, i.e., choosing hyperparameters, and 1124 sentences for testing. 3

NLM Training
We train our NLMs described in Section 2 with SGD, using: (a) a source window of size 5, i.e., 11-gram source context 4 , (b) a 4-word target history, i.e., 5-gram target LM, (c) a self-normalized weight α = 0.1, (d) a mini-batch of size 128, and (e) a learning rate of 0.1 (training costs are normalized by the mini-batch size). All weights are uniformly initialized in [−0.01, 0.01]. We train our models for 4 epochs (after 2 epochs, the learning rate is halved every 0.5 epoch). The vocabularies are limited to the top 40K frequent words for both Chinese and English. All words not in   Table 1. With more layers, the model succeeds in learning more complex functions; the prediction, hence, becomes more accurate as evidenced by smaller perplexities for both the validation and test sets. Interestingly, we observe that deeper nets can learn self-normalized NLMs better: the mean log normalization factor, | log Z| in Eq. (3), is driven towards 0 as the depth increases. 5

MT Reranking with NLMs
Our MT models are built using the Phrasal MT toolkit (Cer et al., 2010). In addition to the standard dense feature set 6 , we include a variety of sparse features for rules, word pairs, and word classes, as described in (Green et al., 2014). Our decoder uses three language models. 7 We use a tuning set of 396K words in the newswire and web domains and tune our systems using online expected error rate training as in (Green et al., 2014). Our tuning metric is (BLEU-TER)/2.
We run a discriminative reranker on the 1000best output of a decoder with MERT. The features used in reranking include all the dense features, 5 As a reference point, though not directly comparable, Devlin et al. (2014) achieved 0.68 for | log Z| on a different test set with the same self-normalized constant α = 0.1. 6 Consisting of forward and backward translation models, lexical weighting, linear distortion, word penalty, phrase penalty and language model. 7 One is trained on the English side of the bitext, one is trained on a 16.3-billion-word monolingual corpus taken from various domains, and one is a class-based language model trained on the same large monolingual corpus.  Table 2: Web-forum Results -TER (T) and BLEU (B) scores on both the dev set (dev10wb dev), used to tune reranking weights, and the test sets (dev10wb syscomtune and p1r6 dev accordingly). Relative improvements between the best system and the baseline as well as the 1-layer model are bolded. † marks improvements that are statistically significant (p < 0.05).
an aggregate decoder score, and an NLM score. We learn the reranker weights on a second tuning set, different from the decoder tuning set, to make the reranker less biased towards the dense features. This second tuning set consists of 33K words of web-forum text and is important to obtain good improvements with reranking.

Results
As shown in Table 2, it is not obvious if the depth-2 model is better than the single layer one, both of which are what past work used. In contrast, reranking with deep NLMs of three or four layers are clearly better, yielding average improvements of 1.0 TER / 1.0 BLEU points over the baseline and 0.5 TER / 0.5 BLEU points over the system reranked with the 1-layer model, all of which are statisfically significant according to the test described in (Riezler and Maxwell, 2005). The fact that the improvements in terms of the intrinsic metrics listed in Table 1 do translate into gains in translation quality is interesting. It reinforces the trend reported in (Luong et al., 2015) that better source-conditioned perplexities lead to better translation scores. This phenomon is a useful result as in the past, many intrinsic metrics, e.g., alignment error rate, do not necessarily correlate with MT quality metrics.

Domain Adaptation
For the sms-chat domain, we use a tune set of 260K words in the newswire, web, and sms-chat domains to tune the decoder weights and a sepa-   Table 3 shows that on the test set, our deep NLM with three layers yields a significant gain of 2.1 TER / 3.1 BLEU points over the baseline and 1.0 TER / 0.5 BLEU points over the 1-layer reranked system. It is worth pointing out that for such a small amount of out-domain training data, depth becomes less effective as exhibited through the insignificant BLEU gain in test and a drop in dev when comparing between the 1-and 3-layer models. We exclude the 4-layer NLM as it seems to have overfitted the training data. Nevertheless, we still achieve decent gains in using NLMs for MT domain adaptation.

NLM Training
We show in Figure 1 the learning curves for various NLMs, demonstrating that deep nets are better than the shallow NLM with a single hidden layer. Starting from minibatch 20K, the ranking is generally maintained that deeper NLMs have better cross-entropies. The gaps become less discernible from minibatch 30K onwards, but numerically, as the model becomes deeper, the average gaps, in perplexities, are consistently 40.1, 1.1, and 2.0. 8 Our sms-chat corpus consists of 146K sentences (1.6M Chinese and 1.9M English words). We randomly select 3000 sentences for validation and 3000 sentences for test. Models are trained for 8 iterations with the same hyperparameters.

Reranking Settings
In Table 4, we compare reranking using all dense features (All) to conditions which use only dense LM features (LM) and optionally, include a word penalty (WP) feature. All these settings include an NLM score and an aggregate decoder score. As shown, it is best to include all dense features at reranking time.

Related Work
It is worth mentioning another active line of research in building end-to-end neural MT systems (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015;Luong et al., 2015;Jean et al., 2015). These methods have not yet demonstrated success on challenging language pairs such as English-Chinese. Arsoy et al. (2012) have preliminarily examined deep NLMs for speech recognition, however, we believe, this is the first work that puts deep NLMs into the context of MT.

Conclusion
In this paper, we have bridged the gap that past work did not show, that is, neural language models with more than two layers can help improve translation quality. Our results confirm the trend reported in (Luong et al., 2015) that sourceconditioned perplexity strongly correlates with MT performance. We have also demonstrated the use of deep NLMs to obtain decent gains in outof-domain conditions.