Memory-enhanced Decoder for Neural Machine Translation

We propose to enhance the RNN decoder in a neural machine translator (NMT) with external memory, as a natural but powerful extension to the state in the decoding RNN. This memory-enhanced RNN decoder is called \textsc{MemDec}. At each time during decoding, \textsc{MemDec} will read from this memory and write to this memory once, both with content-based addressing. Unlike the unbounded memory in previous work\cite{RNNsearch} to store the representation of source sentence, the memory in \textsc{MemDec} is a matrix with pre-determined size designed to better capture the information important for the decoding process at each time step. Our empirical study on Chinese-English translation shows that it can improve by $4.8$ BLEU upon Groundhog and $5.3$ BLEU upon on Moses, yielding the best performance achieved with the same training set.


Introduction
The introduction of external memory has greatly expanded the representational capability of neural network-based model on modeling sequences (Graves et al., 2014), by providing flexible ways of storing and accessing information. More specifically, in neural machine translation, one great improvement came from using an array of vectors to represent the source in a sentence-level memory and dynamically accessing relevant segments of them (alignment)  through content-based addressing (Graves et al., 2014). The success of RNNsearch demonstrated the advantage of saving the entire sentence of arbitrary length in an unbounded memory for operations of next stage (e.g., decoding).
In this paper, we show that an external memory can be used to facilitate the decoding/generation process thorough a memory-enhanced RNN decoder, called MEMDEC. The memory in MEMDEC is a direct extension to the state in the decoding, therefore functionally closer to the memory cell in LSTM (Hochreiter and Schmidhuber, 1997). It takes the form of a matrix with pre-determined size, each column ("a memory cell") can be accessed by the decoding RNN with content-based addressing for both reading and writing during the decoding process. This memory is designed to provide a more flexible way to select, represent and synthesize the information of source sentence and previously generated words of target relevant to the decoding. This is in contrast to the set of hidden states of the entire source sentence (which can viewed as another form of memory) in  for attentive read, but can be combined with it to greatly improve the performance of neural machine translator. We apply our model on English-Chinese translation tasks, achieving performance superior to any published results, SMT or NMT, on the same training data (Xie et al., 2011;Meng et al., 2015;Tu et al., 2016;Hu et al., 2015) Our contributions are mainly two-folds • we propose a memory-enhanced decoder for neural machine translator which naturally extends the RNN with vector state.
• our empirical study on Chinese-English translation tasks show the efficacy of the proposed model.

Roadmap
In the remainder of this paper, we will first give a brief introduction to attention-based neural machine translation in Section 2, presented from the view of encoder-decoder, which treats the hidden states of source as an unbounded memory and the attention model as a content-based reading.
In Section 3, we will elaborate on the memory-enhanced decoder MEMDEC. In Section 4, we will apply NMT with MEMDEC to a Chinese-English task. Then in Section 5 and 6, we will give related work and conclude the paper.

Neural machine translation with attention
Our work is built on attention-based NMT(Bahdanau et al., 2014), which represents the source sentence as a sequence of vectors after being processed by RNN or bi-directional RNNs, and then conducts dynamic alignment and generation of the target sentence with another RNN simultaneously. Attention-based NMT, with RNNsearch as its most popular representative, generalizes the conventional notion of encoder-decoder in using a unbounded memory for the intermediate representation of source sentence and content-based addressing read in decoding, as illustrated in Figure 1. More specifically, at time step t, RNNsearch first get context vector c t after reading from the source representation M S , which is then used to update the state, and generate the word y t (along with the current hidden state s t , and the previously generated word y i−1 ). Formally, given an input sequence x = [x 1 , x 2 , . . . , x Tx ] and the previously generated sequence y <t = [y 1 , y 2 , . . . , y t−1 ], the probability of next word y t is where s t is state of decoder RNN at time step t calculated as where g(·) can be an be any activation function, here we adopt a more sophisticated dynamic operator as in Gated Recurrent Unit (GRU, ). In the remainder of the paper, we will also use GRU to stand for the operator. The reading c t is calculated as where h j is the j th cell in memory M S . More formally, is the annotations of x j and contains information about the whole input sequence with a strong focus on the parts surrounding x j , which is computed by a bidirectional RNN. The weight α t,j is computed by .
where e i,j = v T a tanh(W a s t−1 + U a h j ) scores how well s t−1 and the memory cell h j match. This is called automatic alignment  or attention model (Luong et al., 2015), but it is essentially reading with content-based addressing defined in (Graves et al., 2014). With this addressing strategy the decoder can attend to the source representation that is most relevant to the stage of decoding.

Improved Attention Model
The alignment model α t,j scores how well the output at position t matches the inputs around position j based on s t−1 and h j . It is intuitively beneficial to exploit the information of y t−1 when reading from M S , which is missing from the implementation of attention-based NMT in . In this work, we build a more effective alignment path by feeding both previous hidden state s t−1 and the context word y t−1 to the attention model, inspired by the recent implementation of attention-based NMT 1 . Formally, the calculation of e t,j becomes where 1 github.com/nyu-dl/dl4mt-tutorial/tree/master/session2 •s t−1 = H(s t−1 , e y t−1 ) is an intermediate state tailored for reading from M S with the information of y t−1 (its word embedding being e y t−1 ) added; • H is a nonlinear function, which can be as simple as tanh or as complex as GRU. In our preliminary experiments, we found GRU works slightly better than tanh function, but we chose the latter for simplicity.

Decoder with External Memory
In this section we will elaborate on the proposed memory-enhanced decoder MEMDEC. In addition to the source memory M S , MEMDEC is equipped with a buffer memory M B as an extension to the conventional state vector. Figure 3 contrasts MEMDEC with the decoder in RNNsearch ( Figure 1) on a high level. In the remainder of the paper, we will refer to the conventional state as vector-state (denoted s t ) and its memory extension as memory-state (denoted as M B t ). Both states are updated at each time step in a interweaving fashion, while the output symbol y t is predicted based solely on vector-state s t (along with c t and y t−1 ). The diagram of this memory-enhanced decoder is given in Figure 2.
which then meets the previous prediction y t−1 to form an "intermediate" state-vector where e y t−1 is the word-embedding associated with the previous prediction y t−1 . This pre-state s t is used to read the source memory M S Both readings in Eq. (4) & (6) follow content-based addressing (Graves et al., 2014) (details later in Section 3.1). After that, r t−1 is combined with output symbol y t−1 and c t to update the new vectorstate The update of vector-state is illustrated in Figure 4. Memory-State Update As illustrated in Figure 5, the update for memory-state is simple after the update of vector-state: with the vector-state s t+1 the updated memory-state will be The writing to the memory-state is also content-based, with same forgetting mechanism suggested in (Graves et al., 2014), which we will elaborate with more details later in this section. Prediction As illustrated in Figure 6, the prediction model is same as in , where the score for word y is given by where ω y is the parameters associated with the word y. The probability of generating word y at time t is then given by a softmax over the scores p(y|s t , c t , y t−1 ) = exp(score(y)) y exp(score(y )) . Figure 6: Prediction at time t.

Reading Memory-State
Formally M B t ∈ R n×m is the memory-state at time t after the memory-state update, where n is the number of memory cells and m is the dimension of vector in each cell. Before the vector-state update at time t, the output of reading r t is given by where w R t ∈ R n specifies the normalized weights assigned to the cells in M B t . Similar with the reading from M S ( a.k.a. attention model), we use content-based addressing in determining w R t . More specifically, w R t is also updated from the one from previous time w R t−1 as where is the gate function, with parameters w R g ∈ R m ; • w t gives the contribution based on the current vector-state s t with parameters W R a , U R a ∈ R m×m and v ∈ R m .

Writing to Memory-State
There are two types of operation on writing to memory-state: ERASE and ADD. Erasion is similar to the forget gate in LSTM or GRU, which determines the content to be remove from memory cells. More specifically, the vector µ ERS t ∈ R m specifies the values to be removed on each dimension in memory cells, which is than assigned to each cell through normalized weights w W t . Formally, the memory-state after ERASE is given by • w W t (i) specifies the weight associated with the i th cell in the same parametric form as in Eq. (10)-(12) with generally different parameters. ADD operation is similar with the update gate in LSTM or GRU, deciding how much current information should be written to the memory.
where µ ADD t ∈ R m and W ADD ∈ R m×m . In our experiments, we have a peculiar but interesting observation: it is often beneficial to use the same weights for both reading (i.e., w R

Some Analysis
The writing operation in Eq. (13) at time t can be viewed as an nonlinear way to combine the previous memory-state M B t−1 and the newly updated vector-state s t , where the nonlinearity comes from both the content-based addressing and the gating. This is in a way similar to the update of states in regular RNN, while we conjecture that the addressing strategy in MEMDEC makes it easier to selectively change some content updated (e.g., the relatively short-term content) while keeping other content less modified (e.g., the relatively long-term content).
The reading operation in Eq. (10) can "extract" the content from M B t relevant to the alignment (reading from M S ) and prediction task at time t. This is in contrast with the regular RNN decoder including its gated variants, which takes the entire state vector to for this purpose. As one advantage, although only part of the information in M B t is used at t, the entire memory-state, which may store other information useful for later, will be carry over to time t + 1 for memory-state update (writing).

Experiments on Chinese-English Translation
We test the memory-enhanced decoder to task of Chinese-to-English translation, where MEMDEC is put on the top of encoder same as in .

Datasets and Evaluation metrics
Our training data for the translation task consists of 1.25M sentence pairs extracted from LDC corpora 2 , with 27.9M Chinese words and 34.5M English words respectively. We choose NIST 2002 (MT02) dataset as our development set, and the NIST 2003 (MT03), 2004 (MT04) 2005 (MT05) and 2006 (MT06) datasets as our test sets. We use the case-insensitive 4-gram NIST BLEU score as our evaluation metric as our evaluation metric (Papineni et al., 2002).

Experiment settings
Hyper parameters In training of the neural networks, we limit the source and target vocabularies to the most frequent 30K words in both Chinese and English, covering approximately 97.7% and 99.3% of the two corpora respectively. The dimensions of word embedding is 512 and the size of the hidden layer is 1024. The dimemsion of each cell in M B is set to 1024 and the number of cells n is set to 8.
Training details We initialize the recurrent weight matrices as random orthogonal matrices. All the bias vectors were initialize to zero. For other parameters, we initialize them by sampling each element from the Gaussian distribution of mean 0 and variance 0.01 2 . Parameter optimization is performed using stochastic gradient descent. Adadelta (Zeiler, 2012) is used to automatically adapt the learning rate of each parameter ( = 10 −6 and ρ = 0.95). To avoid gradients explosion, the gradients of the cost function which had 2 norm larger than a predefined threshold 1.0 was normalized to the threshold (Pascanu et al., 2013). Each SGD is of a mini-batch of 80 sentences. We train our NMT model with the sentences of length up to 50 words in training data, while for moses system we use the full training data.
Memory Initialization Each memory cell is initialized with the source sentence hidden state computed as where W INI ∈ R m×2·m ; σ is tanh function. m makes a nonlinear transformation of the source sentence information. ν i is a random vector sampled from N (0, 0.1).
Dropout we also use dropout for our NMT baseline model and MEMDEC to avoid over-fitting (Hinton et al., 2012). The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. In the simplest case, each unit is omitted with a fixed probability p, namely dropout rate. In our experiments, dropout was applied only on the output layer and the dropout rate is set to 0.5. We also try other strategy such as dropout at word embeddings or RNN hidden states but fail to get further improvements.
Pre-training For MEMDEC, the objective function is a highly non-convex function of the parameters with more complicated landscape than that for decoder without external memory, rendering direct optimization over all the parameters rather difficult. Inspired by the effort on easing the training of very deep architectures (Hinton and Salakhutdinov, 2006), we propose a simple pre-training strat-egyFirst we train a regular attention-based NMT model without external memory. Then we use the trained NMT model to initialize the parameters of encoder and parameters of MEMDEC, except those related to memory-state (i.e., {W R a , U R a , v, w R g , W ERS , W ADD }). After that, we fine-tune all the parameters of NMT with MEMDEC decoder, including the parameters initialized with pre-training and those associated with accessing memory-state.

Comparison systems
We compare our method with three state-of-the-art systems: • Moses: an open source phrase-based translation system 3 : with default configuration and a 4gram language model trained on the target portion of training data.
• RNNSearch: an attention-based NMT model with default settings. We use the open source system GroundHog as our NMT baseline 4 .
• Coverage model: a state-of-the-art variant of attention-based NMT model (Tu et al., 2016) which improves the attention mechanism through modelling a soft coverage on the source representation.

Results
The main results of different models are given in Table 1 (Tu et al., 2016), which achieves the best published result on this training set. For MEMDEC the number of cells is set to 8.
with COVERAGE mechanism (Tu et al., 2016), which is about 2 BLEU over than the published result after adding fast attention and dropout. In this comparison MEMDEC wins with big margin (+1.46 BLEU score).  Pre-training plays an important role in optimizing the memory model. As can be seen in Tab.2, pre-training improves upon our baseline +1.11 BLEU score on average, but even without pre-training our model still gains +1.04 BLEU score on average. Our model is rather robust to the memory size: with merely four cells, our model will be over 2 BLEU higher than RNNsearch . This further verifies our conjecture the the external memory is mostly used to store part of the source and history of target sentence.

Case study
We show in Table 5 sample translations from Chinese to English, comparing mainly MEMDEC and the RNNsearch model for its pre-training. It is appealing to observe that MEMDEC can produce more fluent translation results and better grasp the semantic information of the sentence.

Related Work
There is a long thread of work aiming to improve the ability of RNN in remembering long sequences, with the long short-term memory RNN (LSTM) (Hochreiter and Schmidhuber, 1997) being the most salient examples and GRU  being the most recent one. Those works focus on

MEMDEC
The delegation told the US today that the Bush administration has approved the US delegation's visit to north Korea from 6 to 10 january . base The delegation told the US that the Bush administration has approved the US to begin his visit to north Korea from 6 to 10 January. designing the dynamics of the RNN through new dynamic operators and appropriate gating, while still keeping vector form RNN states. MEMDEC, on top of the gated RNN, explicitly adds matrixform memory equipped with content-based addressing to the system, hence greatly improving the power of the decoder RNN in representing the information important for the translation task. MEMDEC is obviously related to the recent effort on attaching an external memory to neural networks, with two most salient examples being Neural Turing Machine (NTM) (Graves et al., 2014) and Memory Network (Weston et al., 2014). In fact MEMDEC can be viewed as a special case of NTM, with specifically designed reading (from two different types of memory) and writing mechanism for the translation task. Quite remarkably MEMDEC is among the rare instances of NTM which significantly improves upon state-of-the-arts on a real-world NLP task with large training corpus.
Our work is also related to the recent work on machine reading (Cheng et al., 2016), in which the machine reader is equipped with a memory tape, enabling the model to directly read all the previous hidden state with an attention mechanism. Different from their work, we use an external bounded memory and make an abstraction of previous information. In (Meng et al., 2015), Meng et. al. also proposed a deep architecture for sequence-to-sequence learning with stacked layers of memory to store the intermediate representations, while our external memory was applied within a sequence.

Conclusion
We propose to enhance the RNN decoder in a neural machine translator (NMT) with external memory. Our empirical study on Chinese-English translation shows that it can significantly improve the performance of NMT.