Understanding the Difficulty of Training Transformers

Transformers have proved effective in many NLP tasks. However, their training requires non-trivial efforts regarding designing cutting-edge optimizers and learning rate schedulers carefully (e.g., conventional SGD fails to train Transformers effectively). Our objective here is to understand $\textit{what complicates Transformer training}$ from both empirical and theoretical perspectives. Our analysis reveals that unbalanced gradients are not the root cause of the instability of training. Instead, we identify an amplification effect that influences training substantially -- for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable, since it amplifies small parameter perturbations (e.g., parameter updates) and results in significant disturbances in the model output. Yet we observe that a light dependency limits the model potential and leads to inferior trained models. Inspired by our analysis, we propose Admin ($\textbf{Ad}$aptive $\textbf{m}$odel $\textbf{in}$itialization) to stabilize stabilize the early stage's training and unleash its full potential in the late stage. Extensive experiments show that Admin is more stable, converges faster, and leads to better performance. Implementations are released at: https://github.com/LiyuanLucasLiu/Transforemr-Clinic.


Introduction
Transformers (Vaswani et al., 2017) have led to a series of breakthroughs in various deep learning tasks (Devlin et al., 2019;Velickovic et al., 2018).They do not contain recurrent connections and can parallelize all computations in the same layer, thus improving effectiveness, efficiency, and scalability.Training Transformers, however, requires extra efforts.For example, although stochastic gradient descent (SGD) is the standard algorithm for conventional RNNs and CNNs, it converges to bad/suspicious local optima for Trans- formers (Zhang et al., 2019b).Moreover, comparing to other neural architectures, removing the warmup stage in Transformer training results in more severe consequences such as model divergence (Popel and Bojar, 2018;Liu et al., 2020a).
Here, we conduct comprehensive analyses in empirical and theoretical manners to answer the question: what complicates Transformer training.
Our analysis starts from the observation: the original Transformer (referred to as Post-LN) is less robust than its Pre-LN variant 2 (Baevski and Auli, 2019;Xiong et al., 2019;Nguyen and Salazar, 2019).We recognize that gradient vanishing issue is not the direct reason causing such difference, since fixing this issue alone cannot stabilize Post-LN training.It implies that, besides unbalanced gradients, there exist other factors influencing model training greatly.
With further analysis, we recognize that for each Transformer residual block, the dependency on its x (pe)   Layer Norm x (pd)   x (od)   x (oe)   x (od) 3i 3 x (od) 3i 2 x (od) 3i 1 x (od) 3i x (pe) 2i 1 x (pe) 2i 2 x (pe) 2i x (pd) 3i 3 x (pd) 3i 2 x (pd) 3i 1 x (pd) 3i residual branch 3 plays an essential role in training stability.First, we find that a Post-LN layer has a heavier dependency on its residual branch than a Pre-LN layer.As in Figure 7, at initialization, a Pre-LN layer has roughly the same dependency on its residual branch and any previous layer, whereas a Post-LN layer has a stronger dependency on its residual branch (more discussions are elaborated in Section 4.1).We find that strong dependencies of Post-LN amplify fluctuations brought by parameter changes and destabilize the training (as in Theorem 2 and Figure 4).Besides, the loose reliance on residual branches in Pre-LN generally limits the algorithm's potential and often produces inferior models.
In light of our analysis, we propose Admin, an adaptive initialization method which retains the merits of Pre-LN stability without hurting the performance.It restricts the layer dependency on its residual branches in the early stage and unleashes the model potential in the late stage.We conduct experiments on IWSLT'14 De-En, WMT'14 En-De, and WMT'14 En-Fr; Admin is more stable, converges faster, and achieves better performance.For example, without introducing any additional hyper-parameters, Admin successfully stabilizes 72-layer Transformer training on WMT'14 En-Fr and achieves a 43.80 BLEU score. 3For a residual block x + f (x), its shortcut output refers to x, its residual branch output refers to f (x), and the dependency on its residual branch refers to Var[f (x)]  Var [x+f (x)] .

Preliminaries
Transformer Architectures and Notations.The Transformer architecture contains two types of sublayers, i.e., Attention sub-layers and Feedforward (FFN) sub-layers.They are composed of mainly three basic modules (Vaswani et al., 2017), i.e., Layer Norm (f LN ), Multi-head Attention (f ATT ), and Feedforward Network (f FFN ).As illustrated in Figure 2, the Pre-LN Transformer and the Post-LN Transformer organize these modules differently.For example, a Pre-LN encoder organizes the Self-Attention sublayer as x 2i−1 is the output of the i-th Self-Attention sub-layer.Here, we refer 2i−2 ) as the residual branches and their outputs as the residual outputs, in contrast to layer/sub-layer outputs, which integrates residual outputs and shortcut outputs.
Notation elaborations are shown in Figure 2. In particular, we use superscripts to indicate network architectures (i.e., the Pre-LN Encoder), use subscripts to indicate layer indexes (top layers have larger indexes), all inputs and outputs are formulated as Sequence-Len × Hidden-Dim.
Layer Norm.Layer norm (Ba et al., 2016)  as f LN (x) = γ x−µ σ + ν, where µ and σ are the mean and standard deviation of x.
Multi-head Attention.Multi-head Attentions allows the network to have multiple focuses in a single layer and plays a crucial role in many tasks (Chen et al., 2018).
It is defined as (with H heads): H × D matrices, where D is the hidden state dimension.Parameters without subscript refer the concatenation of all Hhead parameters, e.g., (•e) ) and x (•e) is the encoder output), and Self-Attention (f S-ATT (x) = f ATT (x, x, x)).

Unbalanced Gradients
In this study, we strive to answer the question: what complicates Transformer training.Our analysis starts from the observation: Pre-LN training is more robust than Post-LN, while Post-LN is more likely to reach a better performance than Pre-LN.In a parameter grid search (as in Figure 10), Pre-LN converges in all 15 settings, and Post-LN diverges in 7 out of 15 settings; when Post-LN converges, it outperforms Pre-LN in 7 out of 8 settings.We seek to reveal the underlying factor that destabilizes Post-LN training and restricts the performance of Pre-LN.
In this section, we focus on the unbalanced gradients (e.g., gradient vanishing).We find that, although Post-LN suffers from gradient vanishing and Pre-LN does not, gradient vanishing is not the direct reason causing the instability of Post-LN.Specifically, we first theoretically and empirically establish that only Post-LN decoders suffer from gradient vanishing and Post-LN encoders do not.We then observe that fixing the gradient vanishing issue alone cannot stabilize training.

Gradients at Initialization
As gradient vanishing can hamper convergence from the beginning, it has been regarded as the major issue causing unstable training.Also, recent studies show that this issue exists in the Post-LN Transformer, even after using residual connections (Xiong et al., 2019).Below, we establish that only Post-LN decoders suffer from the gradient vanishing, and neither Post-LN encoders, Pre-LN encoders, nor Pre-LN decoders.
We use ∆x to denote gradients, i.e., ∆x = ∂L ∂x where L is the training objective.Following previous studies (Glorot and Bengio, 2010), we analyze the gradient distribution at the very beginning of training and find only Encoder-Attention sub-layers in Post-LN suffers from gradient vanishing.First, we conduct analysis from a theoretical Num of Sub-Layers (FFN or Self-Attention) in the Encoder Random Perturbations, i.e., Gradient Updates, i.e., |F(x 0 , W ) R 2 = 0.99 R 2 = 0.99  The update magnitude is consistent, even with unbalanced gradients.
18-Layer Pre-LN Encoder Self-Attention Light color indicates higher layers To make sure that the assumptions of Theorem 2 match the real-world situation, we further conduct empirical verification.At initialization, we calculate ||∆x and visualize 3. It verifies that only Post-LN decoders suffer from the gradient vanishing.Besides, we can observe that the dropping of gradient norms mostly happens in the backpropagation from encoder-attention outputs (encoder-attention bars) to its inputs (self-attention bars, since the output of self-attention is the input of encoder-attention).This pattern is further explained in Appendix A.3.

Impact of the Gradient Vanishing
Now, we explore whether gradient vanishing is the direct cause of training instability.
First, we design a controlled experiment to show the relationship between gradient vanishing and training stability.We construct a hybrid Transformer by combining a Post-LN encoder and a Pre-LN decoder.As in Section 3.  ability to handle unbalanced gradients) and necessitates using adaptive optimizers.More discussions are included in Appendix A.4.

Instability from Amplification Effect
We find that unbalanced gradients are not the root cause of the instability of Post-LN, which implies the existence of other factors influencing model training.Now, we go beyond gradient vanishing and introduce the amplification effect.Specifically, we first examine the difference between Pre-LN and Post-LN, including their early-stage and latestage training.Then, we show that Post-LN's training instability is attributed to layer dependency's amplification effect, which intensifies gradient updates and destabilizes training.

Impact of Layer Norms Positions
As described in Section 2, both Pre-LN and Post-LN employ layer norm to regularize inputs and outputs.Different residual outputs are aggregated and normalized in residual networks before serving as inputs of other layers (i.e., residual outputs will be scaled to ensure the integrated input to have a consistent variance).To some extend, layer norm treats the variance of residual outputs as weights to average them.For example, for Post-LN Self-Attention, we have x 2i−2 but decreases the proportion of other residual outputs.Intuitively, this is similar to the weight mechanism of the weighted average.
The position of layer norms is the major difference between Pre-LN and Post-LN and makes them aggregate residual outputs differently (i.e., using different weights).As in Figure 6, all residual outputs in Pre-LN are only normalized once before feeding into other layers (thus only treating residual output variances as weights); in Post-LN, most residual outputs are normalized more than once, and different residual outputs are normalized for different times.For example, if all layers are initialized in the same way, output variances of different Pre-LN residual branches would be similar, and the aggregation would be similar to the simple average.Similarly, for Post-LN, nearby residual outputs are normalized by fewer times than others, thus having relatively larger weights.We proceed to calculate and analyze these weights to understand the impact of layer norm positions.First, we use a i to refer a i √ Var a i (i.e., normalized outputs of i-th residual branch) and x i to refer x i

√
Var x i (i.e., normalized outputs of i-th layer or normalized inputs of (i+1)-th residual branch).Then, we describe their relationships as x i = j≤i β i,j a j , where β i,j integrates scaling operations of all layer norms (including Var[a i ]).For example, Pre-LN sets . Intuitively, β i,j describes the proportion of j-th residual branch outputs in i-th layer outputs, thus reflects the dependency among layers.
We visualize β i,j in Figure 7.For a Post-LN layer, its outputs rely more on its residual branch from the initialization to the end.At initialization, Pre-LN layer outputs have roughly the same reliance on all previous residual branches.As the training advances, each layer starts to rely more on its own residual outputs.However, comparing to Post-LN, Pre-LN layer outputs in the final model still has less reliance on their residual branches.
Intuitively, it is harder for Pre-LN layers to depend too much on their own residual branches.In Pre-LN, layer outputs (i.e., x (p•) i ) are not normalized, and their variances are likely to be larger for higher layers 6 .Since is likely to be smaller for higher layers, which restricts i-th layer outputs from depending too much on its residual branch and inhibits the network from reaching its full potential.In other words, Pre-LN restricts the network from being too deep (i.e., if it is hard to distinguish x i+1 , appending one layer would be similar to doubling the width of the last layer), while Post-LN gives the network the choice of being wider or deeper.

Amplification Effect at Initialization
Although depending more on residual branches allows the model to have a larger potential, it amplifies the fluctuation brought by parameter changes.For a network x = F(x 0 , W ) where x 0 is the model input and W is the parameter, the output change caused by parameter perturbations is Its relationship with N is described in Theorem 2, and the derivation is elaborated in Appendix B. THEOREM 2. -Consider a N -layer Transformer x = F( x 0 , W ) at initialization, where x 0 is the input and W is the parameter.If the layer dependency stays the same after a parameter change (i.e., β i,j has the same value after changing W to W * , where W is randomly initialized and δ = W * − W is independent to W ), the output change (i.e., is the same for all layers, Pre-LN sets β 2 i,i as 1/i, and Post-LN sets β 2 i,i as a constant.Thus, we have Corollary 1 and 2 as below. They show that, since Post-LN relies more on residual branches than Pre-LN (i.e., has a larger β 2 i,i ), the perturbation is amplified to a larger magnitude.To empirically verify these relationships, we calculate |F(x 0 , W ) − F(x 0 , W * )| 2 for Pre-LN.These relationships match the observation in our experiments (as in Figure 4).For further verification, we measure their correlation magnitudes by R 2 and find R 2 = 0.99 in both cases.
Moreover, we replace the random noise δ with optimization updates (i.e., setting W * = W + Adam(∆W ), where opt(•) is update calculated by the Adam optimizer) and visualize output shifts.This replacement makes the correlation between |F − F * | 2 2 and N (for Post-LN) or log N (for Pre-LN) to be weaker (i.e., R 2 = 0.75).Still, as in Figure 4, the output shift |F − F * | 2 2 for Post-LN is larger than Pre-LN by multiple magnitudes.
Intuitively, large output shifts would destabilize the training (Li et al., 2018).Also, as elaborated in Appendix B, the constant C in Theorem 2 is related to network derivatives and would be smaller as training advances, which explains why warmup is also helpful for the standard SGD.Therefore, we conjecture it is the large output shift of Post-LN results in unstable training.We proceed to stabilize Post-LN by controlling the dependency on residual branches in the early stage of training.

Admin -Adaptive Model Initialization
In light of our analysis, we add additional parameters (i.e., ω) to control residual dependencies of Post-LN and stabilize training by adaptively initializing ω to ensure an O(log N ) output change.
Due to different training configurations and model specificities (e.g., different models may use different activation functions and dropout ratios), it is hard to derive a universal initialization method.Instead, we decompose model initialization into two phrases: Profiling and Initialization.Specifically, Admin adds new parameters ω and constructs its i-th sub-layer as vector and • is element-wise product.Then the Profiling phrase and Initialization phrase are: Profiling.After initializing the network with a standard method (initializing ω i as 1), conduct forward propagation without parameter updating and record the output variance of residual branches (i.e., calculate Var[f i (x i−1 )]).Since all elements in the same parameter/output matrix are independent to each other and are subject to the same distribution, it is sufficient to use a small number of instances in  this phrase.In our experiments, the first batch (no more than 8192 tokens) is used.
] and initialize all other parameters with the same method used in the Profiling phrase.
In the early stage, Admin sets β 2 i,i to approximately 1 i and ensures an O(log N ) output change, thus stabilizing training.Model training would become more stable in the late stage (the constant C in Theorem 2 is related to parameter gradients), and each layer has the flexibility to adjust ω and depends more on its residual branch to calculate the layer outputs.After training finishes, Admin can be reparameterized as the conventional Post-LN structure (i.e., removing ω).More implementation details are elaborated in Appendix C.
To verify our intuition, we calculate the layer dependency of 18-Layer models and visualize the result in Figure 8. Figures 7 and 8 show that Admin avoids over-large dependencies at initialization and unleashes the potential to make the layer outputs depend more on their residual outputs in the final model.Moreover, we visualize the output change of Admin in Figure 4. Benefiting from the adaptive initialization, the output change of Admin gets roughly the same increase speed as Pre-LN, even constructed in the Post-LN manner.Also, although Admin is formulated in a Post-LN manner and suffers from gradient vanishing, 18-layer Admin successfully converges and outperforms 18-layer Pre-LN (as in Table 2).This evidence supports our intuition that the large dependency on residual branches amplifies the output fluctuation and destabilizes training.

Experiments
We conduct experiments on IWSLT'14 De-En, WMT'14 En-De, and WMT'14 En-Fr.More details are elaborated in Appendix D.

Performance Comparison
We use BLEU as the evaluation matric and summarize the model performance in Table 2. On the WMT'14 dataset, we use Transformer-base models with 6, 12, or 18 layers.Admin achieves a better performance than Post-LN and Pre-LN in all three settings.Specifically, 12-Layer and 18-Layer Post-LN diverges without the adaptive initialization.Pre-LN converges in all settings, but it results in sub-optimal performance.Admin not only stabilizes the training of deeper models but benefits more from the increased model capacity then Pre-LN, which verifies our intuition that the Pre-LN structure limits the model potential.As in Figure 1 and Figure 9, although the 6-layer Pre-LN converges faster than Post-LN, its final performance is worse than Post-LN.In contrast, Admin not only achieves the same convergence speed with Pre-LN in the early stage but reaches a good performance in the late stage.We use 6-layer Transformer-small (its hidden dimension is smaller than the base model) on the IWSLT'14 dataset, and all methods perform similarly.Still, as in Figure 10, Admin outperforms the other two by a small margin.Together with WMT'14 results, it implies the training stability is related to layer number.For shallow networks, the stability difference between Post-LN and Pre-LN is not significant (as in Figure 4), and all methods reach reasonable performance.It is worth mentioning that attention and activation dropouts have an enormous impact on IWSLT'14, which is smaller than WMT'14 datasets.To further explore the potential of Admin, we train Transformers with a larger size.Specifically, we expand the Transformer-base configuration to have a 60-layer encoder and a 12-layer decoder.As in Table 2, our method achieves a BLEU score of 43.8 on the WMT'14 En-Fr dataset, the new state-of-the-art without using additional annotations (e.g., back-translation).More discussions are conducted in Appendix F to compare this model with the current state of the art.Furthermore, in-depth analyses are summarized in Liu et al. (2020b), including systematic evaluations on the model performance (with TER, ME-TEOR, and BLEU), comprehensive discussions on model dimensions (i.e., depth, head number, and hidden dimension), and fine-grained error analysis.It is worth mentioning that the 60L-12L Admin model achieves a 30.1 BLEU score on WMT'14 En-De (Liu et al., 2020b).

Connection to Warmup
Our previous work (Liu et al., 2020a) establishes that the need for warmup comes from the unstable adaptive learning rates in the early stage.Still, removing the warmup phrase results in more severe consequences for Transformers than other architectures.Also, warmup has been found to be useful for the vanilla SGD (Xiong et al., 2019).
Theorem 1 establishes that Var In the early stage of training, the network has larger parameter gradients and thus larger C. Therefore, using a small learning rate at initialization helps to alleviate the massive output shift of Post-LN.We further conduct experiments to explore whether more prolonged warmups can make up the stability difference between Post-LN and Pre-LN.We observe that 18-layer Post-LN training still fails after extending the warmup phrase from 8 thousand updates to 16, 24, and 32 thousand.It shows that learning rate warmup alone cannot neutralize the   instability of Post-LN.Intuitively, massive output shifts not only require a small learning rate but also unsmoothes the loss surface (Li et al., 2018) and make the training ill-conditioned.Admin regularizes the model behavior at initialization and stabilizes the training.To explore whether Admin is able to stabilize the training alone, we remove the warmup phase and conduct a grid search on optimizer hyper-parameters.The results are visualized in Figure 10.It shows that as Post-LN is more sensitive to the choice of hyperparameters, Admin successfully stabilizes the training without hurting its potential.

Comparing to Other Initializations
We compare our methods with three initialization methods, i.e., ReZero (Bachlechner et al., 2020), FixUp (Zhang et al., 2019a), andLookLinear (Balduzzi et al., 2017a).Specifically, we first conduct experiments with 18-layer Transformers on the WMT'14 De-En dataset.In our experiments, we observe that all of ReZero (which does not contain layer normalization), FixUp (which also does not contain layer normalization), and LookLinear (which is incorporated with Post-LN) leads to di-vergent training.With further analysis, we find that the half-precision training and dropout could destabilize FixUp and ReZero, due to the lack of layer normalization.Simultaneously, we find that even for shadow networks, having an over small reliance on residual branches hurts the model performance, which also supports our intuition.For example, as elaborated in Appendix E, applying ReZero to Transformer-small leads to a 1-2 BLEU score drop on the IWSLT'14 De-En dataset.

Related Work
Transformer.Transformer (Vaswani et al., 2017) has led to a series of breakthroughs in various domains (Devlin et al., 2019;Velickovic et al., 2018;Huang et al., 2019;Parmar et al., 2018;Ramachandran et al., 2019).Liu et al. (2020a) show that compared to other architectures, removing the warmup phase is more damaging for Transformers, especially Post-LN.Similarly, it has been found that the original Transformer (referred to as Post-LN) is less robust than its Pre-LN variant (Baevski and Auli, 2019;Nguyen and Salazar, 2019;Wang et al., 2019).Our studies go beyond the existing literature on gradient vanishing (Xiong et al., 2019) and identify an essential factor influencing Transformer training greatly.
Deep Network Initialization.It has been observed that deeper networks can lead to better performance.For example, Dong et al. (2020) find that the network depth players a similar role with the sample number in numerical ODE solvers, which hinders the system from getting more precise results.Many attempts have been made to clear obstacles for training deep networks, including various initialization methods.Based on the independence among initialized parameters, one method is derived and found to be useful to handle the gradient vanishing (Glorot and Bengio, 2010).Similar methods are further developed for ReLU networks (He et al., 2015).He et al. (2016) find that deep network training is still hard even after addressing the gradient vanishing issue and propose residual networks.Balduzzi et al. (2017b) identifies the shattered gradient issue and proposes LookLinear initialization.
On the other hand, although it is observed that scaling residual outputs to smaller values helps to stabilize training (Hanin and Rolnick, 2018;Mishkin and Matas, 2015;Zhang et al., 2019a;Bachlechner et al., 2020;Goyal et al., 2017), there is no systematic analysis on what complicates Transformer training or its underlying connection to the dependency on residual branches.Here, we identify that unbalanced gradients are not the direct cause of the Post-LN instability, recognize the amplification effect, and propose a novel adaptive initialization method.

Conclusion
In this paper, we study the difficulties of training Transformers in theoretical and empirical manners.Our study in Section 3 suggests that the gradient vanishing problem is not the root cause of unstable Transformer training.Also, the unbalanced gradient distribution issue is mostly addressed by adaptive optimizers.In Section 4, we reveal the root cause of the instability to be the strong dependency on residual branches, which amplifies the fluctuation caused by parameter changes and destabilizes model training.In light of our analysis, we propose Admin, an adaptive initialization method to stabilize Transformers training.It controls the dependency at the beginning of training and maintains the flexibility to capture those dependencies once training stabilizes.Extensive experiments verify our intuitions and show that, without introducing additional hyper-parameters, Admin achieves more stable training, faster convergence, and better performance.
Our work opens up new possibilities to not only further push the state-of-the-art but understand deep network training better.It leads to many interesting future works, including generalizing Theorem 2 to other models, designing new algorithms to automatically adapt deep networks to different training configurations, upgrading the Transformer architecture, and applying our proposed Admin to conduct training in a larger scale.

Appendices A Gradients at Initialization
Here, we first reveal that Pre-LN does not suffer from the gradient vanishing.Then we establish that only the Post-LN decoder suffers from the gradient vanishing, but not the Post-LN encoder.For simplicity, we use ∆x to denote gradients, i.e., ∆x = ∂L ∂x where L is the training objective.Following the previous study (Bengio et al., 1994;Glorot and Bengio, 2010;He et al., 2015;Saxe et al., 2013), we analyze the gradient distribution at the very beginning of training, assume that the randomly initialized parameters and the partial derivative with regard to module inputs are independent.

A.1 Pre-LN Analysis
For Pre-LN encoders, we have x ).At initialization, the two terms on the right part are approximately independent and E[ Applying the same analysis to Pre-LN decoders, we can get ∀i ≤ j, Var[∆x ]. Thus, lower layers have larger gradients than higher layers, and gradients do not vanish in the backpropagation.
and the derivatives of modules in the i-th sub-layer are independent, then ∀i ≤ j, Var[∆x i−1 are associated with not only the residual connection but the layer normalization, which makes it harder to establish the connection on their gradients.After making assumptions on the model initialization, we find that lower layers in Post-LN encoder also have larger gradients than higher layers, and gradients do not vanish in the backpropagation through the encoder.
THEOREM 1. -For Post-LN Encoders, if γ and ν in the Layer Norm are initialized as 1 and 0 respectively; all other parameters are initialized by symmetric distributions with zero mean; x (oe) i and ∆x (oe) i are subject to symmetric distributions with zero mean; the variance of x (oe) i is 1 (i.e., normalized by Layer Norm); ∆x (oe) i and the derivatives of modules in i-th sub-layer are independent, we have Proof.We first prove Var[∆x 2i ], i.e., the backpropagation through FFN sublayers does not suffer from gradient vanishing.In Post-LN encoders, the output of FFN sublayers is calculated as 2) .Since at initialization, W (1) and W (2) are independently randomized by symmetric distributions, we have E[b He et al. (2015) establishes that 2i−1 is the output of layer norm, we have Var[x Assuming different terms are also independent in the backpropagation, we have (∆x At initialization, He et al. (2015) establishes that Therefore, we have Combining Equation 1with Equation 2, we have which shows the backpropagation through FFN sublayers does not suffer from gradient vanishing.Now we proceed to prove that, Var[∆x i.e., the backpropagation through Self-Attention sublayers do not suffer from gradient vanishing.In Post-LN encoders, the output of Self-Attention sublayers are calculated as x , and W (V 2 ) are independently randomized by symmetric distributions, we have E[b ]HP h .
Similar to He et al. (2015), we have In the backpropagation, we have At initialization, we assume ∆x 2i−1 and model parameters are independent (He et al., 2015), thus  Although the gradient distribution is unbalanced (e.g., W (V 1) and W (V 2) have larger gradients than W (K) and W (Q) ), adaptive optimizers lead to consistent update magnitudes for different parameters.
Epoch # (iterations over the training set) Therefore, we have Integrating Equation 4with Equation 5, we have Combining Equation 3and Equation 6, we have

A.3 Post-LN Decoder Analysis
In Post-LN, the Encoder-Attention sub-layer suffers from gradient vanishing.The Encoder-Attention sub-layer calculates outputs as x 3i−1 and a . Here x (oe) is encoder outputs and f s is the row-wise softmax function.In the backpropagation, ∆x ] + 1 ≤ σ 2 b,3i−1 .Thus, those backpropagations suffer from gradient vanishing.This observation is further verified in Figure 3, as the encoder attention bars (gradients of encoder-attention outputs) are always shorter than self-attention bars (gradients of encoder-attention inputs), while adjacent self-attention bars and fully connected bars usually have the same length.

A.4 Distributes of Unbalanced Gradients
As in Figure 5 and Figure 11, the gradient distribution of Attention modules is unbalanced even for Pre-LN.Specifically, parameters within the softmax function (i.e., W (K) and W (V 1 ) ) suffer from gradient vanishing (i.e., ∂fs(x 0 ,••• ,x i ,••• ) ∂x i ≤ 1) and have smaller gradients than other parameters.With further analysis, we find it is hard to neutralize the gradient vanishing of softmax.Unlike conventional non-linear functions like ReLU or sigmoid, softmax has a dynamic input length (i.e., for the sentences with different lengths, inputs of softmax have different dimensions).Although this setting allows Attention modules to handle sequential inputs, it restricts them from having stable and consistent backpropagation.Specifically, let us consider the comparison between softmax and sigmoid.For the sigmoid function, although its derivation is smaller than 1, this damping effect is consistent for all inputs.Thus, sigmoid can be neutralized by a larger initialization (Glorot and Bengio, 2010).For softmax, its damping effect is different for different inputs and cannot be neutralized by a static initialization.Also, we observe that adaptive optimizers largely address this issue.Specifically, we calculate the norm of parameter change in consequent epochs (e.g., |W is the checkpoint saved after t epochs) and visualize the relative norm (scaled by the largest value in the same network) in Figure 11.Comparing the relative norm of parameter gradients and parameter updates, we notice that: although the gradient distribution is unbalanced, adaptive optimizers successfully assign different learning rates to different parameters and lead to consistent update magnitudes.This result explains why the vanilla SGD fails for training Transformer (i.e., lacking the ability to handle unbalanced gradient distributions).Besides, it implies that the unbalanced gradient distribution (e.g., gradient vanishing) has been mostly addressed by adaptive optimizers and may not significantly impact the training instability.

B Proof of Theorem 2
Here, we elaborate the derivation for Theorem 2, which establishes the relationship between layer number and output fluctuation brought by parameter change.
THEOREM 2. -Consider a N -layer Transformer x = F( x 0 , W ), where x 0 is the input and W is the parameter.If the layer dependency stays the same after a parameter change (i.e., β i,j has the same value after changing W to W * , where W is randomly initialized and δ = W * − W is independent to W ), the output change (i.e., Var[F(x 0 , W ) − F(x 0 , W * )]) can be estimated as N i=1 β 2 i,i C where C is a constant.Proof.We refer the module in i sub-layer as a i = G i ( x i−1 , W i ), where x i = j≤i β i,j a j is the normalized residual output and a i = a i √ Var a i is the normalized module output.The final output is marked as x = F(x 0 , W ) = j≤N β N,j a j .To simplify the notation, we use the superscript * to indicate variables related to W * , e.g., x * = F(x 0 , W * ) and a At initialization, all parameters are initialized independently.Thus ∀i ̸ = j, a i and a j are independent and 1 = Var[ j≤i β i,j a j ] = j≤i β 2 i,j .Also, since k-layer and (k + 1)-layer share the residual connection to previous layers, ∀i, j ≤ k we have Now, we proceed to analyze Var[ a i − a * i ].Specifically, we have should have the same value for all layers, thus we use a constant C to refer its value ( since the sub-layer of Transformers are mostly using linear weights with ReLU nonlinearity and . Thus, we can rewrite Equation 8 and get

C Admin Implementation Details
As introduced in Section 4.3, we introduce a new set of parameters to rescale the module outputs.Specifically, we refer these new parameters as ω and construct the Post-LN sub-layer as: where • is the element-wise product.
After training, Admin can be reparameterized as the conventional Post-LN structure (i.e., removing ω i ).Specifically, we consider x i = b i σ b γ + ν.Then, for feedforward sub-layers, we have It can be reparameterized by changing γ, ν, 2) , where For Self-Attention sub-layers, we have It can be reparameterized by changing γ, ν, W For Encoder-Attention sub-layers, we have It can be reparameterized by changing γ, ν, W It is easy to find b ′ i = b i in all three situations.From the previous analysis, it is easy to find that introducing the additional parameter ω i is equivalent to rescale some model parameters.In our experiments on IWSLT14 De-En, we find that directly rescaling initialization parameters can get roughly the same performance with introducing ω i .However, it is not very stable when conducting training in a half-precision manner.Accordingly, we choose to add new parameters ω i instead of rescaling parameters.

D Experimental Setup
Our experiments are based on the implementation from the fairseq package (Ott et al., 2019).As to pre-processing, we follow the public released script from previous work (Ott et al., 2019;Lu et al., 2020).For WMT'14 datasets, evaluations are conducted on the provided 'newstest14' file, and more details about them can be found in Bojar et al. (2014).For the IWSLT'14 De-En dataset, more analysis and details can be found in Cettolo et al. (2014).As to model specifics, we directly adopt Transformer-small configurations on the IWSLT'14 De-En dataset and stacks more layers over the Transformer-base model on the WMT'14 En-De and WMT'14 En-Fr datasets.Specifically, on the IWSLT'14 De-En dataset, we use word embedding with 512 dimensions and 6-layer encoder/decoder with 4 heads and 1024 feedforward dimensions; on the WMT'14 En-De and WMT'14 En-Fr datasets, we use word embedding with 512 dimension and 8-head encoder/decoder with 2048 hidden dimensions.Label smoothed cross entropy is used as the objective function with an uncertainty = 0.1 (Szegedy et al., 2016).
For Model training, we use RAdam as the optimizer (Liu et al., 2020a) and adopt almost all hyperparameter settings from Lu et al. (2020).Specifically, for the WMT'14 En-De and WMT'14 En-Fr dataset, all dropout ratios (including (activation dropout and attention dropout) are set to 0.1.For the IWSLT'14 De-En dataset, after-layer dropout is set to 0.3, and a weight decay of 0.0001 is used.As to optimizer, we set (β 1 , β 2 ) = (0.9, 0.98), use inverse sqrt learning rate scheduler with a warmup phrase (8000 steps on the WMT'14 En-De/Fr dataset, and 6000 steps on the IWSLT'14 De-En dataset).The maximum learning rate is set to 1e −3 on the WMT'14 En-De dataset and 7e −4 on the IWSLT'14 De-En and WMT'14 En-Fr datasets.We conduct training for 100 epochs on the WMT'14 En-De dataset, 90 epochs on the IWSLT'14 De-En dataset and 50 epochs on the WMT'14 En-Fr dataset, while the last 10 checkpoints are averaged before inference.
On the IWSLT'14 De-En dataset, we conduct training on one NVIDIA GeForce GTX 1080 Ti GPU and set the maximum batch size to be 4096.On the WMT'14 En-De dataset, we conduct training on four NVIDIA Quadro R8000 GPUs and set maximum batch size (per GPU) as 8196.On the WMT'14 En-Fr dataset, we conduct training with the Nvidia DGX-2 server (6L-6L uses 4 NVIDIA TESLA V100 GPUs and 60L-16L uses 16 NVIDIA TESLA V100 GPUs) and set the maximum batch size (per GPU) as 8000 for 6L-6L and 5000 for 60L-16L.On the IWSLT'14 De-En dataset, Transformer-small models (w.37 M Param.)take a few hours to train.On the WMT'14 En-De dataset, 6L-6L models (w.63 M Param.)take ∼ 1 day to train, 12L-12L (w.107M Param.)models take ∼ 2 days to train, and 18L-18L (w.151M Param.)models take ∼ 3 days to train.On the WMT'14 En-Fr dataset, 6L-6L models (w.67 M Param.)takes ∼ 2 days to train, and 60L-12L models (w.262M Param.)takes ∼ 2.5 days to train.All training is conducted in half-precision with dynamic scaling (with a 256-update scaling window and a 0.03125 minimal scale).All our implementations and pre-trained models would be released publicly.

E Comparison to ReZero
Here, we first conduct comparisons with ReZero (Bachlechner et al., 2020) under two configurationsthe first employs the original ReZero model, and the second adds layer normalizations in a Post-LN manner.As summarized in Table 3, the ReZero initialization leads to a performance drop, no matter layer normalization is used or not.It verifies our intuition that over small dependency restricts the model potential.At the same time, we find that adding layer normalization to ReZero helps to improve the performance.Intuitively, as dropout plays a vital role in regularizing Transformers, layer normalization helps to not only stabilize training but alleviate the impact of turning off dropouts during the inference.

F Performance on the WMT'14 En-Fr
To explore the potential of Admin, we conduct experiments with 72-layer Transformers on the WMT'14 En-Fr dataset (with a 60-layer encoder and 12-layer decoder, we add less layers to decoder to encourage the model to rely more on the source context).
As in Table 4, Admin (60L-12L) achieves a BLEU score of 43.80, the new state-of-the-art on this long-standing benchmark.This model has a 60-layer encoder and a 12-layer decoder, which is significantly deeper than other baselines.Still, since the number of parameters increases in a quadratic speed with regard to hidden dimensions and a linear speed with regard to layer numbers, our model has roughly the same number of parameters with other baselines.It is worth mentioning that Admin even achieves better performance than all variants of pre-trained T5 models, which demonstrates the great potential of our proposed method.Also, Admin achieves a better performance than Pre-LN (60L-12L), which further verifies that the Pre-LN architecture restricts deep models' potential.

Figure 1 :
Figure 1: Lacking enough robustness and stability, the 18-Layer Post-LN Transformer training (i.e.the original architecture) diverges and is omitted in the left graph.Admin not only stabilizes model training but unleashes the model potential for better performance.
Figure 2: The Architecture and notations of Pre-LN Transformers (Left) and Post-LN Transformers (Right).

Figure 6 :
Figure 6: The major difference between Pre-LN and Post-LN is the position of layer norms.
2 2 for Pre-LN and Post-LN and visualize the results in Fig-6 If a0 and a1 are independent, Var[a0 + a1] = Var[a0] + Var[a1]; also, in our experiments Var[x (p•) i ] increases as i becomes larger ure 4. In Corollary 2, N is linearly associated with |F − F * | 2 2 for Post-LN; and in Corollary 1, log N is linearly associated with |F − F * | 2

Figure 9 :
Figure 9: Development PPL on the WMT'14 En-De dataset and the IWLST'14 De-En dataset.

Figure 10 :
Figure 10: BLEU score of Post-LN, Pre-LN and Admin on the IWSLT'14 De-En dataset (x-axis is the β 2 for adaptive optimizers and y-axis is the learning rate).Pre-LN converges in all settings while Post-LN diverges in 7 out of 15 settings.When Post-LN converges, it outperforms Pre-LN in 7 out of 8 settings.Admin stabilizes Post-LN training and outperforms Pre-LN (its best performance is comparable with Post-LN).

Figure 11 :
Figure 11: Relative Norm of Gradient (∆W i , where W i is the checkpoint of i-th epoch) and Update (|W i+1 − W i |) of Self-Attention Parameters in 12-Layer Pre-LN.
plays a vital role in Transformer architecture.It is defined Figure 3: Relative gradient norm histogram (on a log scale) of 18-layer Transformers on the WMT'14 En-De dataset, i.e., the gradient norm of sub-layer outputs, scaled by the largest gradient norm in the same network.

Table 4 :
Performance and model size on WMT'14 En-Fr (AL-BL refers A-layer encoder & B-layer decoder).