Rethinking the Value of Transformer Components

Transformer becomes the state-of-the-art translation model, while it is not well studied how each intermediate component contributes to the model performance, which poses significant challenges for designing optimal architectures. In this work, we bridge this gap by evaluating the impact of individual component (sub-layer) in trained Transformer models from different perspectives. Experimental results across language pairs, training strategies, and model capacities show that certain components are consistently more important than the others. We also report a number of interesting findings that might help humans better analyze, understand and improve Transformer models. Based on these observations, we further propose a new training strategy that can improves translation performance by distinguishing the unimportant components in training.


Introduction
Transformer (Vaswani et al., 2017) has achieved the state-of-the-art performance on a variety of translation tasks. It consists of different stacked components, including self-attention, encoder-attention, and feed-forward layers. However, so far not much is known about the internal properties and functionalities it learns to achieve the performance, which poses significant challenges for designing optimal architectures.
In this work, we bridge the gap by conducting a granular analysis of components on trained Transformer models. We attempt to understand how does each component contribute to the model outputs. Specifically, we explore two metrics to evaluate the impact of a particular component on the model performance: 1) contribution in information flow that manually masks individual component each time and evaluate the performance without that component; and 2) criticality in representation generalization that depends on how much closer the weights can get for each component to the initial weights while still maintaining performance. Those two metrics evaluate the component importance of a trained Transformer model from different perspectives. Empirical results on two benchmarking datasets reveal the following observations ( §3.1): • The decoder self-attention layers are least important, and the decoder feed-forward layers are most important.
• The components that are closer to the model input and output (e.g., lower layers of encoder and higher layers of decoder) are more important than components on other layers.
• Upper encoder-attention layers in decoder are more important than lower encoder-attention layers.
The findings are consistent across different evaluation metrics, translation datasets, initialization seeds and model capacities, demonstrating their robustness.
We further analyze the underlying reason ( §3.2), and find that lower dropout ratio and more training data lead to less unimportant components. Unimportant components can be identified at early stage of training, which are not due to deficient training. Finally, we show that unimportant components can be rewound (Frankle and Carbin, 2019) to further improve the translation performance of Transformer models ( §3.3).

Methodology
In this section, we evaluate the importance of individual Transformer components via two different perspective: contribution in information flow and criticality in representation generalization. The ultimate goal of machine translation is to fully transform the information from the source side to the target side. It is essential to understand how information flows from the input, across the encoder and the decoder, to the output. Figure 1 shows an example to illustrate how information flows across a basic Transformer component (i.e., residual sub-layer).

Contribution in Information Flow
We first try to understand how each sub-layer contributes to the information flow from input to output. To understand the contribution of a particular component in the information flow, we investigate the effect of masking that component. Intuitively, we followed Michel et al. (2019) to manually ablate each component (i.e., replacing the output with zeros) from a trained Transformer, and evaluated the performance of the resulting masked Transformer. The component is important if the performance without that component is significantly worse than the full model's, otherwise it is redundant given the rest of the model.
Formally, we define the contribution score of n-th component as where M n is the BLEU drop by ablating the n-th component. It is first clipped to the range of [0, C] to avoid the minus importance value and exploding drop. Then it is normalized to [0, 1] by dividing the maximum drop M . In this study, we set the constant number C as 10% of the BLEU score of baseline model.

Criticality in Representation Generalization
Zhang et al. (2019) reported the module criticality phenomenon, in which modules of the network present different robustness characteristics to parameter perturbation (e.g. rewinding back to its initialization value). Notice that rewinding to initial value is a relaxation of setting to zeros, due to the initialization of Transformer model is Xavier Initialization with 0 means. The module is critical if rewinding its weights to the initialization harms the network performance, otherwise it is non-critical in full network. Chatterji et al. (2020) theoretically formulated this phenomenon and revealed that the criticality metric is reflective of the network generalization. Specifically, they used a convex combination of the initial weights and the final weights of a module to define an optimization path to traverse. They quantitatively defined the module criticality such that it depends on how much closer the weights can get to the initial weights on this path while still maintaining the performance. Figure 2 shows an example. It measures how much the performance of a model rely on specific module.
Formally, for the n-th component, let θ αn n = (1−α n )θ 0 n +α n θ f n , α n ∈ [0, 1] be the convex combination between initial weights θ 0 n and the final weights θ f n . We define the criticality score of n-th component as In other words, criticality score is the minimum α to maintain the performance drop within a threshold value . The criticality score of n-th component is small means we can move the weight of n-th component a long way back to initialization without hurting model performance. In this study, we use as 0.5 BLEU point, which generally indicates a significant drop of translation performance on the benchmarking datasets.
Although both of the two metrics evaluate the component importance in terms of its effect on model performance, there are considerable differences. The contribution score measures the effect of fully ablating the component on model performance (i.e. hard metric), while the criticality score measures how much the component can be rewound while maintaining the model performance (i.e. soft metric).

Experiments Data and Setup
We conducted experiments on the benchmarking WMT2014 English-German (En-De) and English-French (En-Fr) translation datasets, which consist of 4.6M and 35.5M sentence pairs respectively. We employed BPE (Sennrich et al., 2016) with 32K merge operations for both language pairs, and used case-sensitive 4-gram NIST BLEU score (Papineni et al., 2002) as our evaluation metric.
Unless otherwise stated, the Transformer model consists of 6-layer encoder and decoder. The layer size is 512, the size of feed-forward sub-layer is 2048, and the number of attention heads is 8. We followed the settings in (Vaswani et al., 2017) to train the Transformer models on the En-De and En-Fr datasets. We set the dropout as 0.1 and the initialization seed as 1 for all Transformer models .

Observing Component Importance
In this section, we first measure the component importance of trained Transformer models. Then we vary some settings, which are threats to validity, to verify the consistency of our finding.
Several observations on component importance. Figure 3 shows the importance of Transformer components measured by two metrics. The two importance metrics agree well with each other, and reveal several observations in common: • In general, the decoder self-attention layers ("D:SA") are least important, and the decoder feedforward layers ("D:FF") are most important.
• Lower components in encoder (e.g. "E:SA" and "E:FF") and higher components in decoder (e.g. "D:EA" and "D:FF") are more important. This is intuitive, since these components are closer to the input and output sequences, thus are more important for input understanding and output generation.
Step 1  Figure 3: Importance of individual components measured by (a, b) contribution in information flow, (c, d) criticality in representation generalization. Y-axis is the layer id and X-axis is the type of components. "E", "D", "SA", "EA" and "FF" represent Encoder, Decoder, Self-attention, Encoder-attention and Feedforward layer respectively. Darker cells denote more important components.
• Higher encoder-attention ("D:EA") layers in decoder are more important than lower encoderattention layers. This is the same in Voita et al. (2019) which claims that lower part of decoder is more like a language model. For the other components, the bottom and top layers are more important than the intermediate layer.
We notice the main difference between the results of two metrics is on bottom feed-forward layers in decoder. The contribution score is high but criticality score is low. It is because the performance are bad when α = 0 and 1, but the performance are dramatically good when α ≥ 3. So the contribution is high but criticality is relatively low, according to the definition in Section 2.
In the following experiments, we discuss the threats to validity that could affect our finding. Unless otherwise stated, we use contribution score as the default importance metric and report results on the En-De dataset. Step 4 12 Consistency across different initialization seeds and model capacities. As aforementioned, we evaluate the component importance based on a trained NMT model, which can be influenced by various hyper-parameters. We identify two hyper-parameters that have been reported to significantly influence the model performance: • Initialization Seed: Recent works have shown that neural models are very sensitive to the initialization seeds: even with the same hyper-parameter values, distinct random seeds can lead to substantially different results (Dodge et al., 2020).
• Model Capacity: Depth and width are two key aspects in the design of a neural network architecture.  claimed that the depth of a network may determine the abstraction level, and the width may influence the loss of information in the forwarding pass. Recent studies have also demonstrated the significant effect of varying depth  and width (Vaswani et al., 2017) on NMT models. Figure 4 shows the results of Transformer models with different initialization seeds and model capacities on the En-De dataset. Specifically, we used two other different initialization seeds (i.e., "66" and "99"). For the model capacity setting, we used deeper Transformer (i.e., 12 layer) and wider Transformer (i.e., layer size be 1024). Clearly, the above conclusions hold in all cases, demonstrating the robustness of our findings. In the following experiments, we use Transformer-base with initialization seed 1 as the default model. Results on Transformer trained with structured dropout. The Transformer model is trained without being aware of subsequent layerwised ablating, which potentially affects the validity of our conclusions. In response to this problem, we followed Fan et al. (2020) to explore LayerDrop, a form of structured dropout, which has a regularization effect during training and allows for efficient pruning at inference time.
LayerDrop randomly drops entire components during training, which has the advantage of making the network robust to subsequent pruning. Figure 5 depicts the component importance of Transformer trained with LayerDrop, which reconfirms our claim that different components at different layers make distinct contributions to the model performance.

Analyzing Unimportant Components
In this section, we delve into further analysis of unimportant components. We first find several factors that can affect the number of unimportant components. Then we attempt to find the reason of the existence of unimportant components by representation similarity analysis, learning dynamic analysis and Layerwise Isometry Check. Lower dropout ratio and more training data lead to less unimportant components. The training procedures of neural networks have rapidly evolved in recent years. In the experiments, we identify some factors that wound affect the number of important components:

Dropout is the reason
• Dropout (Hinton et al., 2012): Dropout is a commonly used technology to avoid over-fitting by randomly dropping model weights with a specific probability. In order to maintain the functionality, the model trained with dropout tends to have certain redundancy, which may explain our observation that some components can be pruned without the degrade of performance (i.e., unimportant components).
• Training Data Size: Larger-scale training data generally contains more patterns, which may require more components of the Transformer model to learn (i.e., important components). Figure 6 shows the effect of dropout ratio on component importance. We varied the dropout ratio in [0.0, 0.1, 0.3, 0.5] and trained different Transformer models with different dropout ratios from scratch on the En-De dataset. The BLEU scores are 25.58, 27.56, 27.43 and 25.72 respectively. Generally, the lower the dropout ratio, the fewer number of unimportant components the model has. One possible reason is that higher dropout ratio generally makes the trained model have more redundant components to accomplish the same functionality, thus more components can be pruned without degrading the model performance (i.e., unimportant components). Figure 7 shows the effect of different training data sizes on component importance. We randomly sampled 5M , 10M , 15M , 20M examples from the En-Fr dataset, and trained different Transformer models on different subset. The BLEU scores are 39.67, 39.94, 40.44 and 40.71 respectively. As seen, the more training data, the more important components are required, which confirms our hypothesis.
In all cases, the lowermost E:SA and D:FF components, as well as the uppermost D:EA component are identified as important, which is consistent with the findings in Section 3.1.   (Morcos et al., 2018) to analyze the similarity of layer-wise representation between each component output and final output. Table 1 shows the similarity results. We averaged the similarity score of Top-7 most important layers (listed in Figure 9(b,c)) and Top-7 most unimportant layers. The representations of unimportant components are less similar to the output layer representation, comparing with important components'.
Unimportant components can be identified at early stage of training. Recent studies have revealed that unimportant weights in a dense model can be identified at early stage of training (You et al., 2020). Lee et al. (2020) further claimed that the initialization value decides the unimportant weights. Inspired by these findings, we try to answer the question: are unimportant components created to be unimportant? Figure 8 illustrates the learning dynamics of component importance on the En-De dataset. Although most of the important components can be identified at early stage of training (e.g., epoch 3 or 4), they cannot be identified at initialization. The finding is also consistent with the similar important components of different initialization seeds (Figures 4 (a, b)).
Unimportant components are not due to deficient training. Some researchers may doubt that a component fails to contribute to the model performance (i.e., "unimportant") since it is not fully trained. Lee et al. (2020) claimed when the initial weights are not chosen appropriately, the propagation of input    (2014) introduced dynamical isometry to measure a faithful signal propagation, in which signals propagate in a network isometrically with minimal amplification or attenuation. Lee et al. (2020) showed that a sufficient condition to ensure faithful propagation is layerwise dynamical isometry, which is defined as singular values of the layerwise Jacobians being concentrated around 1. This can guarantee that the signal from layer n is propagated to layer n−1 (or vice versa) without amplification or attenuation in any of its dimension, which leads to efficient update of parameters of the corresponding component. Table 2 lists the results of layerwise dynamical isometry check. Each type of components in different layers have similar layerwise isometry values, which cannot explain their different importance on the model performance. This indicates that the existence of unimportant components are not due to deficient training (i.e. unfaithful signal propagation). The results on decoder self-attention components are different because of the existence of masking.

Distinguishing and Utilizing Group of Unimportant components
In our previous experiments, we observed the effect of ablating one sub-layer each time, without considering what would happen if we ablate more layers at the same time. In this section, we first identify a group of unimportant components from a trained Transformer model, and then investigate how to exploit them to improve translation performance.
Identify a group of unimportant components from a trained model. We first followed Michel et al. (2019) to iteratively ablate multiple components from a trained Transformer model, and report the BLEU score of ablated model (without retraining) in Figure 9(a). Although a few unimportant components (e.g., 3 or 4 components) can be ablated together without performance drop, ablating more components significantly harms the translation performance. These results reconfirm our analysis on the redundancy of components in Section 3.2. For example, suppose two components A and B are considered redundant, individually ablating one of them does not harm model performance. However, it does not necessarily mean ablating both of them would not harm the performance as well. Figures 9(b, c) list the identified group of unimportant components in Transformer models trained on the En-De and En-Fr datasets. Specifically, we ablated 7 most unimportant components (i.e., 20% components) that can harm the model performance most. In the following experiments, we utilize unimportant components to improve translation performance with two strategies, namely components pruning and components rewinding.

Model
En  Table 3: Translation performance of pruning unimportant components. "Shallow" denotes a 4-layer decoder model, which has similar number of parameters with the pruned model. All models are trained from scratch with the same hyper-parameters.
Prune unimportant components and retrain the model. Since some of the layers are consistently unimportant, we built a model without those unimportant components and trained it from scratch (denoted as pruned model). Table 3 lists the translation performance of the pruned model. Since the pruned unimportant components are all distributed in the decoder side, we also implemented Transformer model with shallower decoder, which has the same number of parameters with the pruned model. As seen, the pruned model achieves competitive performance with the standard Transformer, and consistently outperforms the shallow model, demonstrating the reasonableness of the identified unimportant components.

Training En-De En-Fr
Step BLEU Step BLEU   Renda et al. (2020) and fine-tune them together with the other trained components for a few more steps. For a fair comparison, we also fine-tuned the trained Transformer model for the same number of steps. As listed in Table 4, directly fine-tuning the Transformer model ("Continue") does not outperform the standard Transformer, while the rewind technique can further improve translation performance.

Related Work
Our work is inspired by two lines of research: Interpreting transformer and network pruning.
Interpreting Transformer Tranformer (Vaswani et al., 2017) has advanced the state of the art in various NLP tasks. Recently, there has been an increasing amount of work on interpreting specific components of Transformer, such as encoder representations (Raganato and Tiedemann, 2018;Tang et al., 2019a;Yang et al., 2019), multi-head self-attention (Li et al., 2018;Voita et al., 2019;Michel et al., 2019;Geng et al., 2020), and encoder attention (Jain and Wallace, 2019;Li et al., 2019;Tang et al., 2019b). Closely related to our work, Domhan (2018) investigated how much each component of Transformer matters. They revealed that self-attention is more important for the encoder side than the decoder side, and encoder attention and residual feed-forward components are key. The key difference between their work and ours is that they evaluated the impact of individual component by retraining a model with other components, while we investigate their contribution on a trained model. In addition, we conduct more subtle analyses on components at different layers, and show other interesting findings, e.g. the lowest and uppermost layers are generally more important than intermediate layers.

Network Pruning
The state-of-the-art deep neural networks are usually over-parameterized: they have much more parameters than the number of training samples. (Denton et al., 2014). Recent study has shown that more than 90% of the parameters can be pruned without harming the performance of neural networks (Frankle and Carbin, 2019). In response to this problem, several researchers propose pruning to extract sub-networks from the over-parameterized network with no decrease of model performance.
Based on the granularity level of pruning, network pruning methods can be divided into weight pruning and structured pruning. Weight pruning approaches prune the sparse weights distributed in different components (Han et al., 2015;Han et al., 2016), while structured pruning removes coherent groups of weights to preserve the original structure of the network (Lin et al., 2017;Huang et al., 2018).
In the NLP community, recent studies have shown that Transformer is over-parameterized. For example, Voita et al. (2019) and Michel et al. (2019) showed that most self-attention heads can be dropped. Fan et al. (2020) reduced Transformer depth on demand with structured dropout. Along this direction, we analyze the redundancy of Transformer on components level and reveal several interesting findings.

Conclusion
In this work we investigate the impact of individual components in Transformer on model performance. Experimental results in a couple of settings show that different components are not equally important. The decoder self-attention layers are least important, and the decoder feed-forward layers are most important. The components that are closer to the model input and output are consistently more important than the others. Upper encoder-attention layers in decoder are more important than lower encoder-attention layers. Further in-depth analyses reveal that the dropout and training data size can affect the number of unimportant components. We also find that unimportant components can be identified at early stage of training and their existence is not because of deficient training. Finally, we show that rewinding the unimportant components and then fine-tuning the Transformer model for a few more steps can further improve translation performance.
Future directions include designing better approaches to evaluate the impact of components (e.g., from the perspective of information flow), and validating our findings on other NMT architectures such as RNMT (Chen et al., 2018) and ConvS2S (Gehring et al., 2017).