Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

The general trend in NLP is towards increasing model capacity and performance via deeper neural networks. However, simply stacking more layers of the popular Transformer architecture for machine translation results in poor convergence and high computational overhead. Our empirical analysis suggests that convergence is poor due to gradient vanishing caused by the interaction between residual connection and layer normalization. We propose depth-scaled initialization (DS-Init), which decreases parameter variance at the initialization stage, and reduces output variance of residual connections so as to ease gradient back-propagation through normalization layers. To address computational cost, we propose a merged attention sublayer (MAtt) which combines a simplified average-based self-attention sublayer and the encoder-decoder attention sublayer on the decoder side. Results on WMT and IWSLT translation tasks with five translation directions show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU (+1.1 BLEU on average for 12-layer models), while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt. Source code for reproduction will be released soon.


Introduction
The capability of deep neural models of handling complex dependencies has benefited various artificial intelligence tasks, such as image recognition where test error was reduced by scaling VGG nets (Simonyan and Zisserman, 2015) up to hundreds of convolutional layers (He et al., 2015). In NLP, deep self-attention networks have enabled large-scale pretrained language models such as BERT (Devlin et al., 2019) and GPT (Radford 1 Source code for reproduction is available at https:// github.com/bzhangGo/zero and decoder layer (bottom) in Transformer with respect to layer depth (x-axis). Gradients are estimated with ∼3k target tokens at the beginning of training. "DS-Init": the proposed depth-scaled initialization. "6L": 6 layers. Solid lines indicate the vanilla Transformer, and dashed lines denote our proposed method. During back-propagation, gradients in Transformer gradually vanish from high layers to low layers. et al., 2018) to boost state-of-the-art (SOTA) performance on downstream applications. By contrast, though neural machine translation (NMT) gained encouraging improvement when shifting from a shallow architecture (Bahdanau et al., 2015) to deeper ones (Zhou et al., 2016;Wu et al., 2016;, the Transformer (Vaswani et al., 2017), a currently SOTA architecture, achieves best results with merely 6 encoder and decoder layers, and no gains were reported by Vaswani et al. (2017) from further increasing its depth on standard datasets.
We start by analysing why the Transformer does not scale well to larger model depth. We find that the architecture suffers from gradient vanishing as shown in Figure 1, leading to poor convergence. An in-depth analysis reveals that the Transformer is not norm-preserving due to the involvement of and the interaction between residual connection (RC) (He et al., 2015) and layer normalization (LN) (Ba et al., 2016).
To address this issue, we propose depth-scaled initialization (DS-Init) to improve norm preservation. We ascribe the gradient vanishing to the large output variance of RC and resort to strategies that could reduce it without model structure adjustment. Concretely, DS-Init scales down the variance of parameters in the l-th layer with a discount factor of 1 √ l at the initialization stage alone, where l denotes the layer depth starting from 1. The intuition is that parameters with small variance in upper layers would narrow the output variance of corresponding RCs, improving norm preservation as shown by the dashed lines in Figure 1. In this way, DS-Init enables the convergence of deep Transformer models to satisfactory local optima.
Another bottleneck for deep Transformers is the increase in computational cost for both training and decoding. To combat this, we propose a merged attention network (MAtt). MAtt simplifies the decoder by replacing the separate self-attention and encoder-decoder attention sublayers with a new sublayer that combines an efficient variant of average-based self-attention (AAN)  and the encoderdecoder attention. We simplify the AAN by reducing the number of linear transformations, reducing both the number of model parameters and computational cost. The merged sublayer benefits from parallel calculation of (average-based) selfattention and encoder-decoder attention, and reduces the depth of each decoder block.
We conduct extensive experiments on WMT and IWSLT translation tasks, covering five translation tasks with varying data conditions and translation directions. Our results show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU (+1.1 BLEU on average for 12-layer models), while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt.
Our contributions are summarized as follows: • We analyze the vanishing gradient issue in the Transformer, and identify the interaction of residual connections and layer normalization as its source. • To address this problem, we introduce depthscaled initialization (DS-Init). • To reduce the computational cost of training deep Transformers, we introduce a merged attention model (MAtt). MAtt combines a simplified average-attention model and the encoder-decoder attention into a single sublayer, allowing for parallel computation. • We conduct extensive experiments and verify that deep Transformers with DS-Init and MAtt improve translation quality while preserving decoding efficiency.

Related Work
Our work aims at improving translation quality by increasing model depth. Compared with the single-layer NMT system (Bahdanau et al., 2015), deep NMT models are typically more capable of handling complex language variations and translation relationships via stacking multiple encoder and decoder layers (Zhou et al., 2016;Wu et al., 2016;Britz et al., 2017;, and/or multiple attention layers . One common problem for the training of deep neural models are vanishing or exploding gradients. Existing methods mainly focus on developing novel network architectures so as to stabilize gradient back-propagation, such as the fast-forward connection (Zhou et al., 2016), the linear associative unit (Wang et al., 2017), or gated recurrent network variants (Hochreiter and Schmidhuber, 1997;Gers and Schmidhuber, 2001;Cho et al., 2014;Di Gangi and Federico, 2018). In contrast to the above recurrent network based NMT models, recent work focuses on feed-forward alternatives with more smooth gradient flow, such as convolutional networks (Gehring et al., 2017) and selfattention networks (Vaswani et al., 2017).
The Transformer represents the current SOTA in NMT. It heavily relies on the combination of residual connections (He et al., 2015) and layer normalization (Ba et al., 2016) for convergence. Nevertheless, simply extending this model with more layers results in gradient vanishing due to the interaction of RC and LN (see Section 4). Recent work has proposed methods to train deeper Transformer models, including a rescheduling of RC and LN , the transparent attention model  and the stochastic residual connection (Pham et al., 2019). In contrast to these work, we identify the large output variance of RC as the source of gradient vanishing, and employ scaled initialization to mitigate it without any structure adjustment. The effect of careful initialization on boosting convergence was also investigated and verified in previous work (Zhang et al., 2019;Child et al., 2019;Devlin et al., 2019;Radford et al., 2018).
The merged attention network falls into the category of simplifying the Transformer so as to shorten training and/or decoding time. Methods to improve the Transformer's running efficiency range from algorithmic improvements (Junczys-Dowmunt et al., 2018), non-autoregressive translation (Gu et al., 2018;Ghazvininejad et al., 2019) to decoding dependency reduction such as average attention network  and blockwise parallel decoding (Stern et al., 2018). Our MAtt builds upon the AAN model, further simplifying the model by reducing the number of linear transformations, and combining it with the encoder-decoder attention. In work concurrent to ours, So et al. (2019) propose the evolved Transformer which, based on automatic architecture search, also discovered a parallel structure of self-attention and encoder-decoder attention.

Background: Transformer
Given a source sequence X = {x 1 , x 2 , . . . , x n } ∈ R n×d , the Transformer predicts a target sequence Y = {y 1 , y 2 , . . . , y m } under the encoder-decoder framework. Both the encoder and the decoder in the Transformer are composed of attention networks, functioning as follows: where Z x ∈ R I×d and Z y ∈ R J×d are input sequence representations of length I and J respectively, W * ∈ R d×d denote weight parameters. The attention network can be further enhanced with multi-head attention (Vaswani et al., 2017). Formally, the encoder stacks L identical layers, each including a self-attention sublayer (Eq. 2) and a point-wise feed-forward sublayer (Eq. 3): H l ∈ R n×d denotes the sequence representation of the l-th encoder layer. Input to the first layer H 0 is the element-wise addition of the source word embedding X and the corresponding positional encoding. FFN(·) is a two-layer feed-forward network with a large intermediate representation and ReLU activation function. Each encoder sublayer is wrapped with a residual connection (Eq. 4), followed by layer normalization (Eq. 5): where z and z are input vectors, and indicates element-wise multiplication. µ and σ denote the mean and standard deviation statistics of vector z. The normalized z is then re-scaled and re-centered by trainable parameters g and b individually. The decoder also consists of L identical layers, each of them extends the encoder sublayers with an encoder-decoder attention sublayer (Eq. 7) to capture translation alignment from target words to relevant source words: S l ∈ R m×d is the sequence representation of the lth decoder layer. Input S 0 is defined similar to H 0 . To ensure auto-regressive decoding, the attention weights in Eq. 6 are masked to prevent attention to future target tokens. The Transformer's parameters are typically initialized by sampling from a uniform distribution: where d i and d o indicate input and output dimension separately. This initialization has the advantage of maintaining activation variances and backpropagated gradients variance and can help train deep neural networks (Glorot and Bengio, 2010).
One natural way to deepen Transformer is simply enlarging the layer number L. Unfortunately, Figure 1 shows that this would give rise to gradient vanishing on both the encoder and the decoder at the lower layers, and that the case on the decoder side is worse. We identified a structural problem in the Transformer architecture that gives rise to this issue, namely the interaction of RC and LN, which we will here discuss in more detail.
Given an input vector z ∈ R d , let us consider the general structure of RC followed by LN: represents any neural network, such as recurrent, convolutional or attention network, etc. Suppose during back-propagation, the error signal at the output of LN is δ o . Contributions of RC and LN to the error signal are as follows: wherer denotes the normalized input. I is the identity matrix and diag(·) establishes a diagonal matrix from its input. The resulting δ r and δ z are error signals arrived at output r and z respectively. We define the change of error signal as follows: where β (or model ratio), β LN (or LN ratio) and β RC (or RC ratio) measure the gradient norm ratio 2 of the whole residual block, the layer normalization and the residual connection respectively. Informally, a neural model should preserve the gradient norm between layers (β ≈ 1) so as to allow training of very deep models (see Zaeemzadeh et al., 2018). We resort to empirical evidence to analyze these ratios. Results in Table 1 show that LN weakens error signal (β LN < 1) but RC strengthens it (β RC > 1). One explanation about LN's decay effect is the large output variance of RC (Var(r) >  Table 1: Empirical measure of output variance Var(r) of RC and error signal change ratio βLN, βRC and β (Eq. 14) averaged over 12 layers. These values are estimated with ∼3k target tokens at the beginning of training using 12-layer Transformer. "Base": the baseline Transformer. "Ours": the Transformer with DS-Init. Enc and Dec stand for encoder and decoder respectively. Self, Cross and FFN indicate the self-attention, encoder-decoder attention and the feed-forward sublayer respectively.
1) which negatively affects δ r as shown in Eq. 12. By contrast, the short-cut in RC ensures that the error signal at higher layer δ r can always be safely carried on to lower layer no matter how complex ∂f ∂z would be as in Eq. 13, increasing the ratio.

Depth-Scaled Initialization
Results on the model ratio show that self-attention sublayer has a (near) increasing effect (β > 1) that intensifies error signal, while feed-forward sublayer manifests a decreasing effect (β < 1).
In particular, though the encoder-decoder attention sublayer and the self-attention sublayer share the same attention formulation, the model ratio of the former is smaller. As shown in Eq. 7 and 1, part of the reason is that encoder-decoder attention can only back-propagate gradients to lower layers through the query representation Q, bypassing gradients at the key K and the value V to the encoder side. This negative effect explains why the decoder suffers from more severe gradient vanishing than the encoder in Figure 1. The gradient norm is preserved better through the self-attention layer than the encoder-decoder attention, which offers insights on the successful training of the deep Transformer in BERT (Devlin et al., 2019) and GPT (Radford et al., 2018), where encoder-decoder attention is not involved. However, results in Table 1 also suggests that the self-attention sublayer in the encoder is not strong enough to counteract the gradient loss in the feedforward sublayer. That is why BERT and GPT adopt a much smaller standard deviation (0.02) for initialization, in a similar spirit to our solution.
We attribute the gradient vanishing issue to the large output variance of RC (Eq. 12). Considering that activation variance is positively correlated with parameter variance (Glorot and Bengio, 2010), we propose DS-Init and change the original initialization method in Eq. 9 as follows: where α is a hyperparameter in the range of [0, 1] and l denotes layer depth. Hyperparameter α improves the flexibility of our method. Compared with existing approaches , our solution does not require modifications in the model architecture and hence is easy to implement. According to the property of uniform distribution, the variance of model parameters decreases from γ 2 3 to γ 2 α 2 3l after applying DS-Init. By doing so, a higher layer would have smaller output variance of RC so that more gradients can flow back. Results in Table 1 suggest that DS-Init narrows both the variance and different ratios to be ∼1, ensuring the stability of gradient back-propagation. Evidence in Figure 1 also shows that DS-Init helps keep the gradient norm and slightly increases it on the encoder side. This is because DS-Init endows lower layers with parameters of larger variance and activations of larger norm. When error signals at different layers are of similar scale, the gradient norm at lower layers would be larger. Nevertheless, this increase does not hurt model training based on our empirical observation. DS-Init is partially inspired by the Fixup initialization (Zhang et al., 2019). Both of them try to reduce the output variance of RC. The difference is that Fixup focuses on overcoming gradient explosion cased by consecutive RCs and seeks to enable training without LN but at the cost of carefully handling parameter initialization of each matrix transformation, including manipulating initialization of different bias and scale terms. Instead, DS-Init aims at solving gradient vanishing in deep Transformer caused by the structure of RC followed by LN. We still employ LN to standardize layer activation and improve model convergence. The inclusion of LN ensures the stability and simplicity of DS-Init.

Merged Attention Model
With large model depth, deep Transformer unavoidably introduces high computational overhead. This brings about significantly longer training and decoding time. To remedy this issue, we propose a merged attention model for decoder that integrates a simplified average-based selfattention sublayer into the encoder-decoder attention sublayer. Figure 2 highlights the difference.
The AAN model (Figure 2(b)), as an alternative to the self-attention model (Figure 2(a)), accelerates Transformer decoding by allowing decoding in linear time, avoiding the O(n 2 ) complexity of the self-attention mechanism . Unfortunately, the gating sublayer and the feedforward sublayer inside AAN reduce the empirical performance improvement. We propose a simplified AAN by removing all matrix computation except for two linear projections:  where M a denotes the average mask matrix for parallel computation . This new model is then combined with the encoderdecoder attention as shown in Figure 2(c): The mapping W o is shared for SAAN and ATT. After combination, MAtt allows for the parallelization of AAN and encoder-decoder attention.

Datasets and Evaluation
We take WMT14 English-German translation (En-De) (Bojar et al., 2014) as our benchmark for model analysis, and examine the generalization of our approach on four other tasks: WMT14 English-French (En-Fr), IWSLT14 German-English (De-En) (Cettolo et al., 2014), WMT18 English-Finnish (En-Fi) and WMT18 Chinese-English (Zh-En) (Bojar et al., 2018). Byte pair encoding algorithm (BPE) (Sennrich et al., 2016) is used in preprocessing to handle low frequency words. Statistics of different datasets are listed in Table 2. Except for IWSLT14 De-En task, we collect subword units independently on the source and target side of training data. We directly use the preprocessed training data from the WMT18 website 3 for En-Fi and Zh-En tasks, and use new-stest2017 as our development set, newstest2018 as our test set. Our training data for WMT14 En-De and WMT14 En-Fr is identical to previous setups (Vaswani et al., 2017;Wu et al., 2019). We use newstest2013 as development set for WMT14 En-De and newstest2012+2013 for WMT14 En-Fr. Apart from newstest2014 test set 4 , we also evaluate our model on all WMT14-18 test sets for WMT14 En-De translation. The settings for IWSLT14 De-En are as in Ranzato et al. (2016), with 7584 sentence pairs for development, and the concatenated dev sets for IWSLT 2014 as test set (tst2010, tst2011, tst2012, dev2010, dev2012).
We report tokenized case-sensitive BLEU (Papineni et al., 2002) for WMT14 En-De and WMT14 En-Fr, and provide detokenized casesensitive BLEU for WMT14 En-De, WMT18 En-Fi and Zh-En with sacreBLEU (Post, 2018) 5 . We also report chrF score for En-Fi translation which was found correlated better with human evaluation (Bojar et al., 2018). Following previous work (Wu et al., 2019), we evaluate IWSLT14 De-En with tokenized case-insensitive BLEU.

Model Settings
We experiment with both base (layer size 512/2048, 8 heads) and big (layer size 1024/4096, 16 heads) settings as in Vaswani et al. (2017). Except for the vanilla Transformer, we also compare with the structure that is currently default in ten-sor2tensor (T2T), which puts layer normalization before residual blocks . We use an in-house toolkit for all experiments.
Dropout is applied to the residual connection (dp r ) and attention weights (dp a ). We share the target embedding matrix with the softmax projection matrix but not with the source embedding matrix. We train all models using Adam optimizer (0.9/0.98 for base, 0.9/0.998 for big) with adaptive learning rate schedule (warm-up step 4K for base, 16K for big) as in (Vaswani et al., 2017) and label smoothing of 0.1. We set α in DS-Init to 1.0. Sentence pairs containing around 25K∼50K (bs) target tokens are grouped into one batch. We use relatively larger batch size and dropout rate for deeper and bigger models for better convergence. We perform evaluation by averaging last 5 checkpoints. Besides, we apply mixed-precision training to all big models. Unless otherwise stated, we train base and big model with 300K maximum steps, and decode sentences using beam search with a beam size of 4 and length penalty of 0.6. Decoding is implemented with cache to save redundant computations. Other settings for specific translation tasks are explained in the individual subsections.  5K steps with a batch size of 1K target tokens. Time is averaged over 3 runs using Tensorflow on a single TITAN X (Pascal). "-": optimization failed and no result. " ": the same as model 1 . † and ‡ : comparison against 11 and 14 respectively rather than 1 . Base: the baseline Transformer with base setting. Bold indicates best BLEU score. dpa and dpr: dropout rate on attention weights and residual connection. bs: batch size in tokens. We also compare our simplified AAN in MAtt ( 4 ) with two variants: a self-attention network ( 6 ), and the original AAN ( 7 ). Results show minor differences in translation quality, but improvements in training and decoding speed, and a reduction in the number of model parameters. Compared to the baseline, MAtt improves decoding speed by 50%, and training speed by 10%, while having 9% fewer parameters.

WMT14 En-De Translation Task
Result 9 indicates that the gradient vanishing issue prevents training of deep vanilla Transformers, which cannot be solved by only simplifying the decoder via MAtt (10 ). By contrast, both T2T and DS-Init can help. Our DS-Init improves norm preservation through specific parameter initialization, while T2T reschedules the LN position. Results in Table 3 show that T2T underperforms DS-Init by 0.2 BLEU on average, and slightly increases training and decoding time (by 2%) compared to the original Transformer due to additional  LN layers. This suggests that our solution is more effective and efficient. Surprisingly, training deep Transformers with both DS-Init and MAtt improves not only running efficiency but also translation quality (by 0.2 BLEU), compared with DS-Init alone. To get an improved understanding, we analyze model performance on both training and development set. Results in Table 4 show that models with DS-Init yield the best perplexity on both training and development set, and those with T2T achieve the best BLEU on the training set. However, DS-Init+MAtt performs best in terms of BLEU on the development set. This indicates that the success of DS-Init+MAtt comes from its better generalization rather than better fitting training data.
We also attempt to apply DS-Init on the encoder alone or the decoder alone for 12-layer models. Unfortunately, both variants lead to unstable optimization where gradients tend to explode at the    Figure 3 shows that deeper Transformers yield better performance. However, improvements are steepest going from 6 to 12 layers, and further improvements are small. Table 6 lists the results in big setting and compares with current SOTA. Big models are trained with dp a = 0.1 and dp r = 0.3. The 6-layer baseline and the deeper ones are trained with batch size of 48K and 54K respectively. Deep Transformer with our method outperforms its 6-layer counterpart by over 0.4 points on newstest2014 and around 0.1 point on newstest2014∼newstest2018. Our model outperforms the transparent model ) (+1.58 BLEU), an approach for the deep encoder. Our model performs on par with current SOTA, the dynamic convolution model (DCNN) (Wu et al., 2019). In particular, though DCNN achieves encouraging performance on newstest2014, it falls behind the baseline on other test sets. By contrast, our model obtains more consistent performance improvements.

Comparison with Existing Work
In work concurrent to ours, Wang et al. (2019) discuss how the placement of layer normalization affects deep Transformers, and compare the original post-norm (which we consider our baseline) and a pre-norm layout (which we call T2T). Their results also show that pre-norm allows training of deeper Transformers. Our results show that deep post-norm Transformers are also trainable with appropriate initialization, and tend to give slightly better results.

Results on Other Translation Tasks
We use 12 layers for our model in these tasks. We enlarge the dropout rate to dp a = 0.3, dp r = 0.5 for IWSLT14 De-En task and train models on WMT14 En-Fr and WMT18 Zh-En with 500K steps. Other models are trained with the same settings as in WMT14 En-De.
We report translation results on other tasks in Table 5. Results show that our model beats the baseline on all tasks with gains of over 1 BLEU,  except the WMT18 En-Fi where our model yields marginal BLEU improvements (+0.3 BLEU). We argue that this is due to the rich morphology of Finnish, and BLEU's inability to measure improvements below the word level. We also provide the chrF score in which our model gains 0.6 points.
In addition, speed measures show that though our model consumes 50+% more training time, there is only a small difference with respect to decoding time thanks to MAtt.

Analysis of Training Dynamics
Our analysis in Figure 1 and Table 1 is based on gradients estimated exactly after parameter initialization without considering training dynamics. Optimizers with adaptive step rules, such as Adam, could have an adverse effect that enables gradient scale correction through the accumulated first and second moments. However, results in Figure 4 show that without DS-Init, the encoder gradients are less stable and the decoder gradients still suffer from the vanishing issue, particularly at the first layer. DS-Init makes the training more stable and robust. 6

Conclusion and Future Work
This paper discusses training of very deep Transformers. We show that the training of deep Transformers suffers from gradient vanishing, which we mitigate with depth-scaled initialization. To improve training and decoding efficiency, we propose a merged attention sublayer that integrates a simplified average-based self-attention sublayer into the encoder-decoder attention sublayer. Experimental results show that deep models trained with these techniques clearly outperform a vanilla Transformer with 6 layers in terms of BLEU, and outperforms other solutions to train deep Transformers . Thanks to the more efficient merged attention sublayer, we achieve these quality improvements while matching the decoding speed of the baseline model.
In the future, we would like to extend our model to other sequence-to-sequence tasks, such as summarization and dialogue generation, as well as adapt the idea to other generative architectures (Zhang et al., 2016. We have trained models with up to 30 layers each for the encoder and decoder, and while training was successful and improved over shallower counterparts, gains are relatively small beyond 12 layers. An open question is whether there are other structural issues that limit the benefits of increasing the depth of the Transformer architecture, or whether the benefit of very deep models is greater for other tasks and dataset.