Towards Two-Dimensional Sequence to Sequence Model in Neural Machine Translation

This work investigates an alternative model for neural machine translation (NMT) and proposes a novel architecture, where we employ a multi-dimensional long short-term memory (MDLSTM) for translation modelling. In the state-of-the-art methods, source and target sentences are treated as one-dimensional sequences over time, while we view translation as a two-dimensional (2D) mapping using an MDLSTM layer to define the correspondence between source and target words. We extend beyond the current sequence to sequence backbone NMT models to a 2D structure in which the source and target sentences are aligned with each other in a 2D grid. Our proposed topology shows consistent improvements over attention-based sequence to sequence model on two WMT 2017 tasks, German<->English.


Introduction
The widely used state-of-the-art neural machine translation (NMT) systems are based on an encoder-decoder architecture equipped with attention layer(s).
The encoder and the decoder can be constructed using recurrent neural networks (RNNs), especially long-short term memory (LSTM) (Bahdanau et al., 2014;Wu et al., 2016), convolutional neural networks (CNNs) (Gehring et al., 2017), self-attention units (Vaswani et al., 2017), or a combination of them (Chen et al., 2018). In all these architectures, source and target sentences are handled separately as a one-dimensional sequence over time. Then, an attention mechanism (additive, multiplicative or multihead) is incorporated into the decoder to selectively focus on individual parts of the source sentence.
One of the weaknesses of such models is that the encoder states are computed only once at the beginning and are left untouched with respect to the target histories. In this case, at every decoding step, the same set of vectors are read repeatedly. Hence, the attention mechanism is limited in its ability to effectively model the coverage of the source sentence. By providing the encoder states with the greater capacity to remember what has been generated and what needs to be translated, we believe that we can alleviate the coverage problems such as over-and under-translation.
One solution is to assimilate the context from both source and target sentences jointly and to align them in a two-dimensional grid. Twodimensional LSTM (2DLSTM) is able to process data with complex interdependencies in a 2D space (Graves, 2012).
To incorporate the solution, in this work, we propose a novel architecture based on the 2DLSTM unit, which enables the computation of the encoding of the source sentence as a function of the previously generated target words. We treat translation as a 2D mapping. One dimension processes the source sentence, and the other dimension generates the target words. Each time a target word is generated, its representation is used to compute a hidden state sequence that models the source sentence encoding. In principle, by updating the encoder states across the second dimension using the target history, the 2DLSTM captures the coverage concepts internally by its cell states.

Related Works
MDLSTM (Graves, 2008(Graves, , 2012 has been successfully used in handwriting recognition (HWR) to automatically extract features from raw images which are inherently two-dimensional (Graves and Schmidhuber, 2008;Leifert et al., 2016a;Voigtlaender et al., 2016). Voigtlaender et al. (2016) explore a larger MDLSTM for deeper and wider architectures using an implementation for the graphical processing unit (GPU). It has also been applied to automatic speech recognition (ASR) where a 2DLSTM scans the input over both time and frequency jointly Sainath and Li, 2016). As an alternative architecture to the concept of MDLSTM, Kalchbrenner et al. (2015) propose a grid LSTM that is a network of LSTM cells arranged in a multidimensional grid, in which the cells are communicating between layers as well as time recurrences. Li et al. (2017) also apply the grid LSTM architecture for the endpoint detection task in ASR.
This work, for the first time, presents an end-toend 2D neural model where we process the source and the target words jointly by a 2DLSTM layer.
3 Two-Dimensional LSTM  (Leifert et al., 2016b). A 2DLSTM unit processes a 2D sequential data x ∈ R J×I of arbitrary lengths, J and I. At time step (j, i), the computation of its cell depends on both vertical s j,i−1 and horizontal hidden states s j−1,i (see Equations (1)-(5)). Similar to the LSTM cell, it maintains some state information in an internal cell state c j,i . Besides the input i j,i , the forget f j,i and the output o j,i gates that all control information flows, 2DL-STM employs an extra lambda gate λ j,i . As written in Equ. 5, its activation is computed analogously to the other gates. The lambda gate is used to weight the two predecessor cells c j−1,i and c j,i−1 before passing them through the forget gate (Equation 6). g and σ are the tanh and the sigmoid functions. V s, W s and U s are the weight matrices.
In order to train a 2DLSTM unit, backpropagation through time (BPTT) is performed over two dimensions (Graves, 2008(Graves, , 2012. Thus, the gradient is passed backwards from the time step (J, I) to (1, 1), the origin. More details, as well as the derivations of the gradients, can be found in (Graves, 2008).
4 Two-Dimensional Sequence to Sequence Model We aim to apply a 2DLSTM to map the source and the target sequences into a 2D space as shown in Figure 2. We call this architecture, the two-dimensional sequence to sequence (2D-seq2seq) model. Given a source sequence x J 1 = x 1 , . . . , x J and a target sequence y I 1 = y 1 , . . . , y I , we scan the source sequence from left to right and the target sequence from bottom to top as shown in Figure  2. In the 2D-seq2seq model, one dimension of the 2DLSTM (horizontal-axis in the figure) serves as the encoder and another (vertical axis) plays the role of the decoder. As a pre-step before the 2DLSTM, in order to have the whole source context, a bidirectional LSTM scans the input words once from left to right and once from right to left to compute a sequence of encoder states h J 1 = h 1 , . . . , h J . At time step (j, i), the 2DLSTM receives both encoder state, h j , and the last target embedding vector, y i−1 , as an input. It repeatedly updates the source information, h J 1 , while generating new target word, y i . The state of the 2DLSTM is computed as follows.
where ψ stands for the 2DLSTM as a function. At each decoder step, once the whole source sequence is processed from 1 to J, the last hidden state of the 2DLSTM, s J,i , is used as the context vector. It means, at time step i, t i = s J,i . In order to generate the next target word, y i , a transformation followed by a softmax operation is applied. Therefore: where W o and |V t | are the weight matrix and the target vocabulary respectively.

Training versus Decoding
One practical concern that should be noticed is the difference between the training and the decoding. Since the whole target sequence is known during training, all states of the 2DLSTM can be computed once at the beginning. Slices of it can then be used during the forward and backward training passes. In theory, the complexity of training is O(JI). But, in practice, the training computation can be optimally parallelized to take linear time (Voigtlaender et al., 2016). During the decoding, only the already generated target words are available. Thus, either all 2DLSTM states have to be recomputed, or it has to be extended by an additional row at every time step i that cause higher complexity.

Experiments
We have done the experiments on the WMT 2017 German→English and English→German news tasks consisting of 4.6M training samples collected from the well-known data sets Europarl-v7, News-Commentary-v10 and Common-Crawl. We use newstest2015 as our development set and newstest2016 and -2017 as our test sets, which contain 2169, 2999 and 3004 sentences respectively. No synthetic data and no additional features are used. Our goal is to keep the baseline model simple and standard to compare methods rather that advancing the state-of-the-art systems.
After tokenization and true-casing using Moses toolkit (Koehn et al., 2007), byte pair encoding (BPE)  is used jointly with 20k merge operations. We remove sentences longer than 50 subwords and batch them together with a batch size of 50. All models are trained from scratch by the Adam optimizer (Kingma and Ba, 2014), dropout of 30% (Srivastava et al., 2014) and the norm of the gradient is clipped with the threshold of 1. The final models are the average of the 4 best checkpoints of a single run based on the perplexity on the development set (Junczys-Dowmunt et al., 2016). Decoding is performed using beam search of size 12, without ensemble of various networks.
We have used our in-house implementation of the NMT system which relies on Theano (Bastien et al., 2012) and Blocks (Merriënboer et al., 2015). Our implementation of 2DLSTM is based on CUDA code adapted from (Voigtlaender et al., 2016;Zeyer et al., 2018), leveraging some speedup.
The models are evaluated using case-sensitive BLEU (Papineni et al., 2002) computed by mteval-v13a 1 and case-sensitive TER (Snover et al., 2006) using tercom 2 . We also report perplexities on the development set.
Attention Model: the attention based sequence to sequence model (Bahdanau et al., 2014) is selected as our baseline that performs quite well. The model consists of one layer bidirectional encoder and a unidirectional decoder with an additive attention mechanism. All words are projected into a 500-dimensional embedding on both sides. To explore the performance of the models with respect to hidden size, we try LSTMs (Hochreiter and Schmidhuber, 1997) with both 500 and 1000 nodes.
2D-Seq2Seq Model: we apply the same embedding size of that of the attention model. The 2DLSTM, as well as the bidirectional LSTM  Table 1 in the rows 1 and 2. As it is seen, for size n = 500, the 2D-seq2seq model outperforms the standard attention model on average by 0.7% BLEU and 0.6% TER on De→En, 0.4% BLEU and no improvements in TER on En→De. The model is also superior for larger hidden size (n = 1000) on average by 0.5% BLEU and 0.3% TER on De→En, 0.9% BLEU and 1.0% TER on En→De. In both cases, the perplexity of the 2D-seq2seq model is lower compared to that of the attention model.
The 2D-seq2seq topology is analogous to the bidirectional encoder-decoder model without attention. To examine whether the 2DLSTM reduces the need of attention, in the second set of experiments, we equip our model with a weighted sum of 2DLSTM states, t i , over j positions to dynamically select the most relevant information. In other words: In these equations, γ j,i is the normalized weight over source positions, s j,i is the 2DLSTM states and W and v are weight matrices. As the results shown in the Table 1 in the rows 2 and 3, adding an additional weighting layer on top of the 2DLSTM layer does not help in terms of BLEU and rarely helps in TER.
By updating the encoder states across the second dimension with respect to the target history, the 2D-seq2seq model can internally indicate which source words have already been translated and where it should focus next. Therefore, it reduces the risk of over-and under-translation. To examine our assumption, we compare the 2D-seq2seq model with two NMT models where the concepts such as fertility and coverage have been addressed (Tu et al., 2016;Cohn et al., 2016).
Coverage Model: in the coverage model, we feed back the last alignments from the time step i − 1 to compute the attention weight at time step i. Therefore, in the coverage model, we redefine the attention weight, α i,j , as: where a is an attention function followed by the softmax. h j and s i−1 are the the encoder and the previous decoder states respectively. In our experiments, we use additive attention similar to (Bahdanau et al., 2014). Fertility Model: in the fertility model, we feed back the sum of the alignments over the past decoder steps to indicate how much attention has been given to the source position j up to step i and divide it over the fertility of source word at position j. This term depends on the encoder states and it varies if the word is used in a different context (Tu et al., 2016).
where N specifies the maximum value for the fertility which set to 2 in our experiments. υ φ is a weight vector.  As it is seen in Table 1, rows 2, 4 and 5, our proposed model is 0.3% BLEU ahead and 0.3% TER worse compared to the fertility approach and slightly better compared to the coverage one. We note, the fertility and coverage models were trained using embedding size of 620.
We have also qualitatively verified the coverage issue in Table 2 by showing an example from the test set. Without the knowledge of which source words have already been translated, the attention layer is at risk of attending to the same positions multiple times. This could lead to over-translation. Similarly, under-translation could be occur when the attention model rarely focusing at the corresponding source positions. As shown in the example, the 2DLSTM can internally track which source positions have already contributed to the target generation.
Speed: we have also compared the models in terms of speed on a single GPU training. In general, the training and decoding speed of the 2D-seq2seq model is 791 and 0.7 words/s respectively compared to those of standard attention model which is 2944 and 48 words/s. The computation of the added weighting mechanism is negligible in this case. This is still an initial architecture which indicates the necessity of multi-GPU usage. We also expect to speedup the decoding phase by avoiding the unnecessary recomputation of previous 2DLSTM states. In the current implementation, at each target step, we re-compute the 2DL-STM states from time step 0 to i − 1, while we only need to store the states from the last step i−1. This does not influence our results, as it is purely an implementation issue, not algorithm. However, decoding will still be slower than the training. One suggestion for further speedup of training phase is applying truncated BPTT on both directions to reduce the number of updates.
The 2DLSTM can be simply combined with self-attention layers (Vaswani et al., 2017) in the encoder and the decoder for better context repre-sentation as well as RNMT+ (Chen et al., 2018) that is composed of standard LSTMs. We believe that 2D-seq2seq model can be potentially applied to the other applications where sequence to sequence modeling is helpful.

Conclusion and Future Works
We have introduced a novel 2D sequence to sequence model (2D-seq2seq), a network that applies a 2DLSTM unit to read both the source and the target sentences jointly. Hence, in each decoding step, the network implicitly updates the source representation conditioned on the generated target words so far. The experimental results show that we outperform the attention model on two WMT 2017 translation tasks. We have also shown that our model implicitly handles the coverage issue.
As future work, we aim to develop a bidirectional 2DLSTM and consider stacking up 2DLSTMs for a deeper model. We consider the results promising and try more language pairs and fine-tune the hyperparameters.