Plan, Attend, Generate: Character-Level Neural Machine Translation with Planning

We investigate the integration of a planning mechanism into an encoder-decoder architecture with attention. We develop a model that can plan ahead when it computes alignments between the source and target sequences not only for a single time-step but for the next k time-steps as well by constructing a matrix of proposed future alignments and a commitment vector that governs whether to follow or recompute the plan. This mechanism is inspired by strategic attentive reader and writer (STRAW) model, a recent neural architecture for planning with hierarchical reinforcement learning that can also learn higher level temporal abstractions. Our proposed model is end-to-end trainable with differentiable operations. We show that our model outperforms strong baselines on character-level translation task from WMT’15 with fewer parameters and computes alignments that are qualitatively intuitive.


Introduction
Character-level neural machine translation (NMT) is an attractive research problem (Lee et al., 2016;Chung et al., 2016;Luong and Manning, 2016) because it addresses important issues encountered in word-level NMT. Word-level NMT systems can suffer from problems with rare words (Gulcehre et al., 2016) or data sparsity, and the existence of compound words without explicit segmentation in certain language pairs can make learning alignments and translations more difficult. Character-level neural machine translation mitigates these issues.
In this work we propose integrating a planning algorithm with the standard encoder-decoder architecture for character-level NMT, using planning * Equal Contribution specifically to improve the alignment between source and target sequences. We cast alignment (also called attention) as a planning problem, whereas it has traditionally been treated as a search problem.
The model we propose creates an explicit plan of source-target alignments to use at future time-steps, based on its current observation and a summary of its past actions; it may modify this plan as needed. The planning mechanism itself is inspired by the strategic attentive reader and writer (STRAW) of Vezhnevets et al. (2016).
Our work is motivated by the intuition that, although natural language (speech and writing) is generated sequentially because of human physiological constraints, it is almost certainly not conceived word-by-word.
Planning, i.e., choosing some goal along with candidate macro-actions to arrive at it, is one way to induce coherence in natural language. Learning to generate long coherent sequences or how to form alignments over long source contexts is difficult for existing models. In the case of machine translation, performance of encoder-decoder models with attention deteriorates as sequence length increases (Cho et al., 2014;Sutskever et al., 2014). This effect can be more pronounced in character-level NMT, because the length of sequences in character-level translation can be much longer than word-level translation. A planning mechanism could make the decoder's search for alignments more tractable and scalable.
Our model is based on the well-known encoderdecoder framework for NMT. Its encoder is a recurrent neural network (RNN) that reads the source (a sequence of byte pairs representing text in some language) and encodes it as a sequence of vector representations; the decoder is a second RNN that generates the target translation character-by-character in the target language. The decoder uses an attention mechanism to align its internal state to vectors in the source encoding that are relevant to the current generation step (see Bahdanau et al. (2015) for the original description). To plan ahead explicitly rather than focusing primarily on what is relevant at the present time, our model's internal state is augmented with (i) an action plan matrix and (ii) a commitment plan vector. The action plan matrix is a template of alignments that the model intends to follow at future time-steps, specifically a sequence of probability distributions over source tokens. The commitment plan vector governs whether to recompute the action plan or to continue following it, and as such models discrete decisions.
Because of computational constraints we here apply planning only on the input sequence, via searching for alignments. We find this alignment-based planning to be helpful in the translation task. For other NLP tasks, however, planning could be applied explicitly for generation as well. Recent work by Bahdanau et al. (2016) on actor-critic methods for sequence prediction, for example, can be seen as this kind of generative planning. We evaluate our model and report results on character-level translation tasks from WMT'15 for English to German, English to Finnish, and English to Czech language pairs. On almost all pairs we observe improvements over a baseline that represents the state-of-the-art in neural character-level translation. In our NMT experiments, our model outperforms the baseline despite using significantly fewer parameters and converges faster in training.

Planning for Character-level Neural Machine Translation
We now describe how to integrate a planning mechanism into a sequence-to-sequence architecture with attention (Bahdanau et al., 2015). Our model first creates a plan, then computes a soft alignment based on the plan, and generates at each time-step in the decoder. We refer to our model as PAG (Plan-Attend-Generate).

Notation and Encoder
As input our model receives a sequence of tokens, X = (x 0 ,···,x |X| ), where |X| denotes the length of X. It processes these with the encoder, a bidirectional RNN. At each input position i we obtain annotation vector h i by concatenating the forward and backward encoder states, where h → i denotes the hidden state of the encoder's forward RNN and h ← i denotes the hidden state of the encoder's backward RNN.
Through the decoder the model predicts a sequence of output tokens, Y = (y 1 ,···,y |Y | ). We denote by s t the hidden state of the decoder RNN generating the target output token at time-step t.

Alignment and Decoder
Our goal is a mechanism that plans which parts of the input sequence to focus on for the next k time-steps of decoding. For this purpose, our model computes an alignment plan matrix A t ∈R k×|X| and commitment plan vector c t ∈ R k at each time-step. Matrix A t stores the alignments for the current and the next k−1 timesteps; it is conditioned on the current input, i.e. the token predicted at the previous time-step y t , and the current context ψ t , which is computed from the input annotations h i . The recurrent decoder function, f dec-rnn (·), receives s t−1 , y t , ψ t as inputs and computes the hidden state vector (1) Context ψ t is obtained by a weighted sum of the encoder annotations, where the soft-alignment vector α t = softmax(A t [0]) ∈ R |X| is a function of the first row of the alignment matrix. At each time-step, we compute a candidate alignment-plan matrixĀ t whose entry at the i th row is where f align (·) is an MLP and β i t denotes a summary of the alignment matrix's i th row at time t−1. The summary is computed using an MLP, f r (·), operating row-wise on A t−1 : . The commitment plan vector c t governs whether to follow the existing alignment plan, by shifting it forward from t − 1, or to recompute it. Thus, c t represents a discrete decision. For the model to operate discretely, we use the recently proposed Gumbel-Softmax trick (Jang et al., 2016;Maddison et al., 2016) in conjunction with the straight-through estimator  to backpropagate through c t . 1 The model further learns the temperature for the Gumbel-Softmax as proposed in (Gulcehre et al., 2017). Both the commitment vector and the action plan matrix are initialized with ones; this initialization is not modified through training.  Figure 1: Our planning mechanism in a sequenceto-sequence model that learns to plan and execute alignments. Distinct from a standard sequence-tosequence model with attention, rather than using a simple MLP to predict alignments our model makes a plan of future alignments using its alignment-plan matrix and decides when to follow the plan by learning a separate commitment vector. We illustrate the model for a decoder with two layers s t for the first layer and the s t for the second layer of the decoder. The planning mechanism is conditioned on the first layer of the decoder (s t ).
Alignment-plan update Our decoder updates its alignment plan as governed by the commitment plan. Denoted by g t the first element of the discretized commitment planc t . In more detail, g t =c t [0], where the discretized commitment plan is obtained by setting c t 's largest element to 1 and all other elements to 0. Thus, g t is a binary indicator variable; we refer to it as the commitment switch. When g t = 0, the decoder simply advances the time index by shifting the action plan matrix A t−1 forward via the shift function ρ(·). When g t = 1, the controller reads the action-plan matrix to produce the summary of the plan, β i t . We then compute the updated alignment plan by interpolating the previous alignment plan matrix A t−1 with the candidate alignment plan matrixĀ t . The mixing ratio is determined by a learned update gate u t ∈R k×|X| , whose elements u ti correspond to tokens in the input sequence and are computed by an MLP with sigmoid activation, f up (·): To reiterate, the model only updates its alignment plan when the current commitment switch g t is active. Otherwise it uses the alignments planned and committed at previous time-steps. Commitment-plan update The commitment plan also updates when g t becomes 1. If g t is 0, the shift function ρ(·) shifts the commitment vector forward and appends a 0-element. If g t is 1, the model recomputes c t using a single layer MLP (f c (·)) followed by a Gumbel-Softmax, andc t is recomputed by discretizing c t as a one-hot vector: c t =one_hot(c t ).
We provide pseudocode for the algorithm to compute the commitment plan vector and the action plan matrix in Algorithm 2. An overview of the model is depicted in Figure 1.

Alignment Repeat
In order to reduce the model's computational cost, we also propose an alternative approach to computing the candidate alignment-plan matrix at every step. Specifically, we propose a model variant that reuses the alignment from the previous time-step until the commitment switch activates, at which time the model computes a new alignment. We call this variant repeat, plan, attend, and generate (rPAG). rPAG can be viewed as learning an explicit segmentation with an implicit planning mechanism in an unsupervised fashion. Repetition can reduce the computational complexity of the alignment mechanism drastically; it also eliminates the need for an explicit alignment-plan matrix, which reduces the model's memory consumption as well. We provide pseudocode for rPAG in Algorithm 2.

Training
We use a deep output layer (Pascanu et al., 2013) to compute the conditional distribution over output tokens, (6) where W o is a matrix of learned parameters and we have omitted the bias for brevity. Function f o is an MLP with tanh activation.
The full model, including both the encoder and decoder, is jointly trained to minimize the (conditional) negative log-likelihood where the training corpus is a set of (x (n) ,y (n) ) pairs and θ denotes the set of all tunable parameters. As noted in (Vezhnevets et al., 2016), the proposed model can learn to recompute very often which decreases the utility of planning. In order to avoid this behavior, we introduce a loss that penalizes the model for committing too often, where λ com is the commitment hyperparameter and k is the timescale over which plans operate.

Experiments
Character-level neural machine translation (NMT) is an attractive research problem (Lee et al., 2016;Chung et al., 2016;Luong and Manning, 2016) because it addresses important issues encountered in word-level NMT. Word-level NMT systems can suffer from problems with rare words (Gulcehre et al., 2016) or data sparsity, and the existence of compound words without explicit segmentation in some language pairs can make learning alignments between different languages and translations to be more difficult. Characterlevel neural machine translation mitigates these issues.
In our NMT experiments we use byte pair encoding (BPE) (Sennrich et al., 2015) for the source sequence and characters at the target, the same setup described in Chung et al. (2016). We also use the same preprocessing as in that work. 2 We present our experimental results in Table 2. Models were tested on the WMT'15 tasks for English to German (En→De), English to Czech (En→Cs), and English to Finnish (En→Fi) language pairs. The table shows that our planning mechanism improves translation performance over our baseline (which reproduces the results reported in (Chung et al., 2016) to within a small margin). It does this with fewer updates and fewer parameters. We trained (r)PAG for 350K updates on the training set, while the baseline was trained for 680K updates. We used 600 units in (r)PAG's encoder and decoder, while the baseline used 512 in the encoder and 1024 units in the decoder. In total our model has about 4M fewer parameters than the baseline. We tested all models with a beam size of 15.
As can be seen from Table 2, layer normalization (Ba et al., 2016) improves the performance of PAG model significantly. However, according to our results on En→De, layer norm affects the performance of our rPAG only marginally. Thus, we decided not to train rPAG with layer norm on other language pairs.
In Table 1, we present the results for PAG using the biscale decoder. In Figure 2, we show qualitatively that our model constructs smoother alignments. At each word that the baseline decoder generates, it aligns the first few characters to a word in the source sequence, but for the remaining characters places the largest alignment weight on the last, empty token of the source sequence. This is because the baseline becomes confident of which word to generate after the first few (c) Figure 2: We visualize the alignments learned by PAG in (a) and the biscale baseline model in (b). As depicted, the alignments learned by PAG look more accurate intuitively and appear smoother than those of the baseline. The baseline tends to focus too much attention on the last word of the sequence, which is sensible to do on average because of German's structure, whereas our model places higher weight on the last word mainly when it generates a space token.  Table 2: The results of different models on WMT'15 task on English to German, English to Czech and English to Finnish language pairs. We report BLEU scores of each model computed via the multi-blue.perl script. The best-score of each model for each language pair appears in bold-face. We use newstest2013 as our development set, newstest2014 as our "Test 2014" and newstest2015 as our "Test 2015" set. † denotes the results of the baseline that we trained using the hyperparameters reported in (Chung et al., 2016) and the code provided with that paper. For our baseline, we only report the median result, and do not have multiple runs of our models.
characters, and it generates the remainder of the word mainly by relying on language-model predictions.
We observe that (r)PAG converges faster with the help of the improved alignments, as illustrated by the learning curves in Figure 3.

Conclusions and Future Work
In this work we addressed a fundamental issue in neural generation of long sequences by integrating planning into the alignment mechanism of sequenceto-sequence architectures. We proposed two different planning mechanisms: PAG, which constructs explicit plans in the form of stored matrices, and rPAG, which plans implicitly and is computationally cheaper. The (r)PAG approach empirically improves alignments over long input sequences. We demonstrated our models' capabilities through results on character-level machine translation, an algorithmic task, and question generation. In machine translation, models with planning outperform a state-of-the-art baseline on almost all language pairs using fewer parameters. As a future work, we plan to test our planning mechanism at the outputs of the model and other sequence to sequence tasks as well.