Deep Architectures for Neural Machine Translation

It has been shown that increasing model depth improves the quality of neural machine translation. However, different architectural variants to increase model depth have been proposed, and so far, there has been no thorough comparative study. In this work, we describe and evaluate several existing approaches to introduce depth in neural machine translation. Additionally, we explore novel architectural variants, including deep transition RNNs, and we vary how attention is used in the deep decoder. We introduce a novel"BiDeep"RNN architecture that combines deep transition RNNs and stacked RNNs. Our evaluation is carried out on the English to German WMT news translation dataset, using a single-GPU machine for both training and inference. We find that several of our proposed architectures improve upon existing approaches in terms of speed and translation quality. We obtain best improvements with a BiDeep RNN of combined depth 8, obtaining an average improvement of 1.5 BLEU over a strong shallow baseline. We release our code for ease of adoption.


Introduction
Neural machine translation (NMT) is a wellestablished approach that yields the best results on most language pairs (Bojar et al., 2016;Cettolo et al., 2016). Most systems are based on the sequence-to-sequence model with attention (Bahdanau et al., 2015) which employs single-layer recurrent neural networks both in the encoder and in the decoder.
Unlike feed-forward networks where depth is straightforwardly defined as the number of noninput layers, recurrent neural network architectures with multiple layers allow different connection schemes (Pascanu et al., 2014) that give rise to different, orthogonal, definitions of depth  which can affect the model performance depending on a given task. This is further complicated in sequence-to-sequence models as they contain multiple sub-networks, recurrent or feed-forward, each of which can be deep in different ways, giving rise to a large number of possible configurations.
In this work we focus on stacked and deep transition recurrent architectures as defined by Pascanu et al. (2014). Different types of stacked architectures have been successfully used for NMT (Zhou et al., 2016;. However, there is a lack of empirical comparisons of different deep architectures. Deep transition architectures have been successfully used for language modeling (Zilly et al., 2016), but not for NMT so far. We evaluate these architectures, both alone and in combination, varying the connection scheme between the different components and their depth over the different dimensions, measuring the performance of the different configurations on the WMT news translation task. 1 Related work includes that of Britz et al. (2017), who have performed an exploration of NMT architectures in parallel to our work. Their experiments, which are largely orthogonal to ours, focus on embedding size, RNN cell type (GRU vs. LSTM), network depth (defined according to the architecture of ), attention mechanism and beam size. Gehring et al. (2017) recently proposed a NMT architecture based on convolutions over fixed-sized windows rather than RNNs, and they reported results for different model depths and attention mechanism configurations. A similar feedforward architecture which uses multiple pervasive attention mechanisms rather than convolutions was proposed by Vaswani et al. (2017), who also report results for different model depths.

NMT Architectures
All the architectures that we consider in this work are GRU (Cho et al., 2014a) sequence-to-sequence transducers (Sutskever et al., 2014;Cho et al., 2014b) with attention (Bahdanau et al., 2015). In this section we describe the baseline system and the variants that we evaluated.

Baseline Architecture
As our baseline, we use the NMT architecture implemented in Nematus, which is described in more depth by Sennrich et al. (2017b). We augment it with layer normalization (Ba et al., 2016), which we have found to both improve translation quality and make training considerably faster.
For our discussion, it is relevant that the baseline architecture already exhibits two types of depth: • recurrence transition depth in the decoder RNN which consists of two GRU transitions per output word with an attention mechanism in between, as described in Firat and Cho (2016).
• feed-forward depth in the attention network that computes the alignment scores and in the output network that predicts the target words. Both these networks are multi-layer perceptrons with one tanh hidden layer.

Deep Transition Architectures
In a deep transition RNN (DT-RNN), at each time step the next state is computed by the sequential application of multiple transition layers, effectively using a feed-forward network embedded inside the recurrent cell. In our experiments, these layers are GRU transition blocks with independently trainable parameters, connected such that the "state" output of one of them is used as the "state" input of the next one. Note that each of these GRU transition is not individually recurrent, recurrence only occurs at the level of the whole multi-layer cell, as the "state" output of the last  GRU transition for the current time step is carried over as the "state" input of the first GRU transition for the next time step. Applying this architecture to NMT is a novel contribution.

Deep Transition Encoder
As in a baseline shallow Nematus system, the encoder is a bidirectional recurrent neural network. Let L s be the encoder recurrence depth, then for the i-th source word in the forward direction the forward source word state where the input to the first GRU transition is the word embedding x i , while the other GRU transitions have no external inputs. Recurrence occurs as the previous word state − → h i−1,Ls enters the computation in the first GRU transition for the current word. The reverse source word states are computed similarly and concatenated to the forward ones to form the bidirectional source word states

Deep Transition Decoder
The deep transition decoder is obtained by extending the baseline decoder in a similar way. Recall that the baseline decoder of Nematus already has a transition depth of two, with the first GRU transition receiving as input the embedding of the previous target word and the second GRU transition receiving as input a context vector computed by the attention mechanism. We extend this decoder architecture to an arbitrary transition depth L t as follows: s j,1 = GRU 1 (y j−1 , s j−1,Lt ) s j,2 = GRU 2 (ATT(C, s j,1 ), s j,1 ) where y j−1 is the embedding of the previous target word and ATT(C, s i,1 ) is the context vector computed by the attention mechanism. GRU transitions other than the first two do not have external inputs. The target word state vector s j ≡ s j,Lt is then used by the feed-forward output network to predict the current target word. A diagram of this architecture is shown in Figure 1. The output network can be also made deeper by adding more feed-forward hidden layers.

Stacked architectures
A stacked RNN is obtained by having multiple RNNs (GRUs in our experiments) run for the same number of time steps, connected such that at each step the bottom RNN takes "external" inputs from the outside, while each of the higher RNN takes as its "external" input the "state" output of the one below it. Residual connections between states at different depth (He et al., 2016) are also used to improve information flow. Note that unlike deep transition GRUs, here each GRU transition block constitutes a cell that is individually recurrent, as it has its own state that is carried over between time steps.

Stacked Encoder
In this work we consider two types of bidirectional stacked encoders: an architecture similar to Zhou et al. (2016) which we denote here as alternating encoder (Figure 2), and one similar to  which we denote as biunidirectional encoder ( Figure 3).
Our contribution is the empirical comparison of these architectures, both in isolation and in combination with the deep transition architecture.

Alternating Stacked Encoder
The forward part of the encoder consists of a stack of GRU recurrent neural networks, the first one processing words in  the forward direction, the second one in the backward direction, and so on, in alternating directions. For an encoder stack depth D s , and a source sentence length N , the forward source word state where we assume that − → h 0,k and − → h N +1,k are zero vectors. Note the residual connections: at each level above the first one, the word state of the previous level − → w i,j−1 is added to the recurrent state of the GRU cell − → h i,j to compute the the word state for the current level − → w i,j .
The backward part of the encoder has the same structure, except that the first level of the stack processes the words in the backward direction and the subsequent levels alternate directions.
The forward and backward word states are then concatenated to form bidirectional word states Figure 2.
Biunidirectional Stacked Encoder In this encoder the forward and backward parts are shallow, as in the baseline architecture. Their word states are concatenated to form shallow bidirectional word states ] that are then used as inputs for subsequent stacked GRUs which operate only in the forward sentence direction, hence the name "biunidirectional". Since residual connections are also present, the higher depth   GRUs have a state size twice that of the base ones. This architecture has shorter maximum information propagation paths than the alternating encoder, suggesting that it may be less expressive, but it has the advantage of enabling implementations with higher model parallelism. A diagram of this architecture is shown in Figure 3.
In principle, alternating and biunidirectional stacked encoders can be combined by having D sa alternating layers followed by D sb unidirectional layers.

Stacked Decoder
A stacked decoder can be obtained by stacking RNNs which operate in the forward sentence direction. A diagram of this architecture is shown in Figure 4.
Note that the base RNN is always a conditional GRU (cGRU, Firat and Cho, 2016) which has transition depth at least two due to the way that the context vectors generated by the attention mechanism are used in Nematus. This opens up the possibility of several architectural variants which we evaluated in this work: Stacked GRU The higher RNNs are simple GRUs which receive as input the state from the previous level of the stack, with residual connec-tions between the levels. s j,1,1 = GRU 1,1 (y j−1 , s j−1,1,2 ) c j,1 = ATT(C, s j,1,1 ) s j,1,2 = GRU 1,2 (c j,1 , s j,1,1 ) r j,1 = s j,1,2 s j,k,1 = GRU k (r j,k−1 , s j−1,k,1 ) Note that the higher levels have transition depth one, unlike the base level which has two.
Stacked rGRU The higher RNNs are GRUs whose "external" input is the concatenation of the state below and the context vector from the base RNN. Formally, the states s j,k,1 of the higher RNNs are computed as: This is similar to the deep decoder by .
Stacked cGRU The higher RNNs are conditional GRUs, each with an independent attention mechanism. Each level has two GRU transitions per step j, with a new context vector c j,k computed in between: s j,k,1 = GRU k,1 (r j,k−1 , s j−1,k,1 ) c j,k = ATT(C, s j,k,1 ) s j,k,2 = GRU k,2 (c j,k , s j,1,1 ) for 1 < k ≤ D t Note that unlike the stacked GRU and rGRU, the higher levels have transition depth two.
Stacked crGRU The higher RNNs are conditional GRUs but they reuse the context vectors from the base RNN. Like the cGRU there are two GRU transition per step, but they reuse the context vector c j,1 computed at the first level of the stack: s j,k,1 = GRU k,1 (r j,k−1 , s j−1,k,1 ) s j,k,2 = GRU k,2 (c j,1 , s j,1,1 ) for 1 < k ≤ D t

BiDeep architectures
We introduce the BiDeep RNN, a novel architecture obtained by combining deep transitions with stacking. A BiDeep encoder is obtained by replacing the D s individually recurrent GRU cells of a stacked encoder with multi-layer deep transition cells each composed by L s GRU transition blocks.
For instance, the BiDeep alternating encoder is defined as follows: It is also possible to have different transition depths at each stacking level.

Experiments
All experiments were performed with Nematus (Sennrich et al., 2017b), following Sennrich et al. (2017a) in their choice of preprocessing and hyperparameters. For experiments with deep models, we increase the depth by a factor of 4 compared to the baseline for most experiments; in preliminary experiments, we observed diminishing returns for deeper models.
We trained on the parallel English-German training data of WMT-2017 news translation task, using newstest2013 as validation set. We used early-stopping on the validation cross-entropy and selected the best model based on validation BLEU.
We report cross-entropy (CE) on newstest2013, training speed (on a single Titan X (Pascal) GPU), and the number of parameters. For translation quality, we report case-sensitive, detokenized BLEU, measured with mteval-v13a.pl, on new-stest2014, newstest2015, and newstest2016.
We release the code under an open source license, including it in the official Nematus repository. 2 The configuration files needed to replicate our experiments are available in a separate repository. 3

Layer Normalization
Our first experiment is concerned with layer normalization. We are interested to see how essential layer normalization is for our deep architectures, and compare the effect of layer normalization on a baseline system, and a system with an alternating encoder with stacked depth 4. Results are shown in Table 1. We find that layer normalization is similarly effective for both the shallow baseline model and the deep encoder, yielding an average improvement of 0.8-1 BLEU, and reducing training time substantially. Therefore we use it for all the subsequent experiments.

Deep Encoders
In Table 2 we report experimental results for different architectures of deep encoders, while the decoder is kept shallow.
We find that all the deep encoders perform substantially better than baseline (+0.5-+1.2 BLEU), with no consistent quality differences between each other. In terms of number of parameters and training speed, the deep transition encoder performs best, followed by the alternating stacked encoder and finally the biunidirectional encoder (note that we trained on a single GPU, the biunidirectional encoder may be comparatively faster on multiple GPUs due to its higher model parallelism).   has already some depth), stacked GRU performs similarly to the baseline (-0.1-+0.2 BLEU) and stacked rGRU possibly slightly better (+0.1-+0.2 BLEU).

Deep Decoders
Other deep RNN decoders achieve higher gains. The best results (+0.6 BLEU on average) are achieved by the stacked conditional GRU with independent multi-step attention (cGRU). This decoder, however, is the slowest one and has the most parameters.
The deep transition decoder performs well (+0.5 BLEU on average) in terms of quality and is the fastest and smallest of all the deep decoders that have shown quality improvements.
The stacked conditional GRU with reused attention (crGRU) achieves smaller improvements (+0.3 BLEU on average) and has speed and size intermediate between the deep transition and stacked cGRU decoders. Table 4 shows results for models where both the encoder and the decoder are deep, in addition to the results of the best deep encoder (the deep transition encoder) + shallow decoder reported here for ease of comparison.

Deep Encoders and Decoders
Compared to deep transition encoder alone, we generally see improvements in cross-entropy, but not in BLEU. We evaluate architectures similar to Zhou et al. (2016) (alternating encoder + stacked GRU decoder) and ) (biunidirectional encoder + stacked rGRU decoder), though they are not straight replications since we used GRU cells rather than LSTMs and the implementation details are different. We find that the former architecture performs better in terms of BLEU scores, model size and training speed.
The other variants of alternating encoder + stacked or deep transition decoder perform similarly to alternating encoder + stacked rGRU decoder, but do not improve BLEU scores over the best deep encoder with shallow decoder. Applying the BiDeep architecture while keeping the total depth the same yields small improvements over the best deep encoder (+0.2 BLEU on average), while the improvement in cross-entropy is stronger. We conjecture that deep decoders may be better at handling subtle target-side linguistic phenomena that are not well captured by the 4-gram precision-based BLEU evaluation.
Finally, we evaluate a subset of architectures with a combined depth that is 8 times that of the baseline. Among the large models, the BiDeep model yields substantial improvements (average +0.6 BLEU over the best deep encoder, +1.5 BLEU over the shallow baseline), in addition to cross-entropy improvements. The stacked-only model, on the other hand, performs similarly to the smaller models, despite having even more parameters than the BiDeep model. This shows that it is useful to combine deep transitions with stacking, as they provide two orthogonal kinds of depth that are both beneficial for neural machine translation.

Error Analysis
One theoretical difference between a stacked RNN and a deep transition RNN is that the distance in the computation graph between timesteps is increased for deep transition RNNs. While this allows for arguably more expressive computations to be represented, in principle it could reduce the ability to remember information over long dis-   tances, since each layer may lose information during forward computation or backpropagation. This may not be a significant issue in the encoder, as the attention mechanism provides short paths from any source word state to the decoder, but the decoder contains no such shortcuts between its states, therefore it might be possible that this negatively affects its ability to model long-distance relationships in the target text, such as subject-verb agreement.
Here, we seek to answer this question by testing our models on Lingeval97 (Sennrich, 2017), a test set which provides contrastive translation pairs for different types of errors. For the example of subject-verb agreement, contrastive translations are created from a reference translation by changing the grammatical number of the verb, and we can measure how often the NMT model prefers the correct reference over the contrastive variant.
In Figure 5, we show accuracy as a function of the distance between subject and verb. We find that information is successfully passed over long distances by the deep recurrent transition network. Even for decisions that require information to be carried over 16 or more words, or at least 128 GRU transitions 5 , the deep recurrent transition network 5 some decisions may not require the information to be passed on the target side because the decisions may be possi- achieves an accuracy of over 92.5% (N = 560), higher than the shallow decoder (91.6%), and similar to the stacked GRU (92.7%). The highest accuracy (94.3%) is achieved by the BiDeep network.

Conclusions
In this work we presented and evaluated multiple architectures to increase the model depth of neural machine translation systems.
We showed that alternating stacked encoders (Zhou et al., 2016) outperform biunidirectional ble based on source-side information.
stacked encoders , both in accuracy and (single-GPU) speed. We showed that deep transition architectures, which we first applied to NMT, perform comparably to the stacked ones in terms of accuracy (BLEU, cross-entropy and long-distance syntactic agreement), and better in terms of speed and number of parameters.
We found that depth improves BLEU scores especially in the encoder. Decoder depth, however, still improves cross-entropy if not strongly BLEU scores.
The best results are obtained by our BiDeep architecture which combines both stacked depth and transition depth in both the (alternating) encoder and the decoder, yielding better accuracy for the same number of parameters than systems with only one kind of depth.
We recommend to use combined architectures when maximum accuracy is the goal, or use deep transition architectures when speed or model size are a concern, as deep transition performs very positively in the quality/speed and quality/size trade-off.
While this paper only reports results for one translation direction, the effectiveness of the presented architectures across different data conditions and language pairs was confirmed in followup work. For the shared news translation task of this year's Conference on Machine Translation (WMT17), we built deep models for 12 translation directions, using a deep transition architecture or a stacked architecture (alternating encoder and rGRU decoder), and observe improvements for the majority of translation directions (Sennrich et al., 2017a).