On the Sub-Layer Functionalities of Transformer Decoder

There have been significant efforts to interpret the encoder of Transformer-based encoder-decoder architectures for neural machine translation (NMT); meanwhile, the decoder remains largely unexamined despite its critical role. During translation, the decoder must predict output tokens by considering both the source-language text from the encoder and the target-language prefix produced in previous steps. In this work, we study how Transformer-based decoders leverage information from the source and target languages – developing a universal probe task to assess how information is propagated through each module of each decoder layer. We perform extensive experiments on three major translation datasets (WMT En-De, En-Fr, and En-Zh). Our analysis provides insight on when and where decoders leverage different sources. Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance – a significant reduction in computation and number of parameters, and consequently a significant boost to both training and inference speed.


Introduction
Transformer models have advanced the state-ofthe-art on a variety of natural language processing (NLP) tasks, including machine translation (Vaswani et al., 2017), natural language inference (Shen et al., 2018), semantic role labeling (Strubell et al., 2018), and language representation (Devlin et al., 2019). However, so far not much is known about the internal properties and functionalities it learns to achieve its superior performance, which poses significant challenges for human understanding of the model and potentially designing better architectures. * Work done when interning at Tencent AI Lab.
Recent efforts on interpreting Transformer models mainly focus on assessing the encoder representations (Raganato et al., 2018;Tang et al., 2019a) or interpreting the multi-head self-attentions (Li et al., 2018;Voita et al., 2019;Michel et al., 2019). At the same time, there have been few attempts to interpret the decoder side, which we believe is also of great interest, and should be taken into account while explaining the encoder-decoder networks. The reasons are threefold: (a) the decoder takes both source and target as input, and implicitly performs the functionalities of both alignment and language modeling, which are at the core of machine translation; (b) the encoder and decoder are tightly coupled in that the output of the encoder is fed to the decoder and the training signals for the encoder are back-propagated from the decoder; and (c) recent studies have shown that the boundary between the encoder and decoder is blurry, since some of the encoder functionalities can be substituted by the decoder cross-attention modules (Tang et al., 2019b).
In this study, we interpret the Transformer decoder by investigating when and where the decoder utilizes source or target information across its stacking modules and layers. Without loss of generality, we focus on the representation evolution 1 within a Transformer decoder. To this end, we introduce a novel sub-layer 2 split with respect to their functionalities: Target Exploitation Module (TEM) for exploiting the representation from translation history, Source Exploitation Module (SEM) for exploiting the source-side representation, and Information Fusion Module (IFM) to combine representations from the other two ( §2.2).
Further, we design a universal probing scheme to quantify the amount of specific information embedded in network representations. By probing both source and target information from decoder sub-layers, and by analyzing the alignment error rate (AER) and source coverage rate, we arrive at the following findings: • SEM guides the representation evolution within NMT decoder ( §3.1).
• Higher-layer SEMs accomplish the functionality of word alignment, while lower-layer ones construct the necessary contexts ( §3.2).
• TEMs are critical to helping SEM build word alignments, while their stacking order is not essential ( §3.2).
Last but not least, we conduct a fine-grained analysis on the information fusion process within IFM.
Our key contributions in this work are: 1. We introduce a novel sub-layer split of Transformer decoder with respect to their functionalities.
2. We introduce a universal probing scheme from which we derive aforementioned conclusions about the Transformer decoder.
3. Surprisingly, we find that the de-facto usage of residual FeedForward operations are not efficient, and could be removed in totality with minimal loss of performance, while significantly boosting the training and inference speeds.

Transformer Decoder
NMT models employ an encoder-decoder architecture to accomplish the translation process in an endto-end manner. The encoder transforms the source sentence into a sequence of representations, and the decoder generates target words by dynamically attending to the source representations. Typically, this framework can be implemented with a recurrent neural network (RNN) (Bahdanau et al., 2015), a convolutional neural network (CNN) (Gehring et al., 2017), or a Transformer (Vaswani et al., 2017). We focus on the Transformer architecture, since it has become the state-of-the-art model on machine translation tasks, as well as various text understanding (Devlin et al., 2019) and generation (Radford et al., 2019) tasks.  Figure 1: A sub-layer splitting of Transformer decoder with respect to their functionalities.
Specifically, the decoder is composed of a stack of N identical layers, each of which has three sublayers, as illustrated in Figure 1. A residual connection (He et al., 2016) is employed around each of the three sub-layers, followed by layer normalization (Ba et al., 2016) ("Add & Norm"). The first sub-layer is a self-attention module that performs self-attention over the previous decoder layer: where ATT(·) and LN(·) denote the self-attention mechanism and layer normalization. Q n d , K n d , and V n d are query, key and value vectors that are transformed from the (n-1)-th layer representation L n−1 d . The second sub-layer performs attention over the output of the encoder representation: where K N e and V N e are transformed from the top encoder representation L N e . The final sub-layer is a position-wise fully connected feed-forward network with ReLU activations: The top decoder representation L N d is then used to generate the final prediction.

Sub-Layer Partition
In this work, we aim to reveal how a Transformer decoder accomplishes the translation process utilizing both source and target inputs. To this end, we split each decoder layer into three modules with respect to their different functionalities over the source or target inputs, as illustrated in Figure 1: • Target Exploitation Module (TEM) consists of the self-attention operation and a residual connection, which exploits the target-side translation history from previous layer representations.
• Source Exploitation Module (SEM) consists only of the encoder attention, which dynamically selects relevant source-side information for generation.
• Information Fusion Module (IFM) consists of the rest of the operations, which fuse source and target information into the final layer representation.
Compared with the standard splits (Vaswani et al., 2017), we associate the "Add&Norm" operation after encoder attention with the IFM, since it starts the process of information fusion by a simple additive operation. Consequently, the functionalities of the three modules are well-separated.

Research Questions
Modern Transformer decoder is implemented as multiple identical layers, in which the source and target information are exploited and evolved layerby-layer. One research question arises naturally:

RQ1.
How do source and target information evolve within the decoder layer-by-layer and module-by-module?
In Section 3.1, we introduce a universal probing scheme to quantify the amount of information embedded in decoder modules and explore their evolutionary trends. The general trend we find is that higher layers contain more source and target information, while the sub-layers behave differently. Specifically, the amount of information contained by SEMs would first increase and then decrease. In addition, we establish that SEM guides both source and target information evolution within the decoder.
Since SEMs are critical to the decoder representation evolution, we conduct a more detailed study into the internal behaviors of the SEMs. The exploitation of source information is also closely related to the inadequate translation problem -a key weakness of NMT models (Tu et al., 2016).
We try to answer the following research question:

RQ2. How does SEM exploit the source information in different layers?
In Section 3.2, we investigate how the SEMs transform the source information to the target side in terms of alignment accuracy and coverage ratio (Tu et al., 2016). Experimental results show that higher layers of SEM modules accomplish word alignment, while lower layer ones exploit necessary contexts. This also explains the representation evolution of source information: lower layers collect more source information to obtain a global view of source input, and higher layers extract less aligned source input for accurate translation.
Of the three sub-layers, IFM modules conceptually appear to play a key role in merging source and target information -raising our final question: RQ3. How does IFM fuse source and target information on the operation level?
In Section 3.3, we first conduct a fine-grained analysis of the IFM module on the operation level, and find that a simple "Add&Norm" operation performs just as well at fusing information. Thus, we simplify the IFM module to be only one Add&Norm operation. Surprisingly, this performs similarly to the full model while significantly reducing the number of parameters and consequently boosting both training and inference speed.

Experiments
Data To make our conclusions compelling, all experiments and analysis are conducted on three representative language pairs. For English⇒German (En⇒De), we use WMT14 dataset that consists of 4.5M sentence pairs. The English⇒Chinese (En⇒Zh) task is conducted on WMT17 corpus, consisting of 20.6M sentence pairs. For English⇒French (En⇒Fr) task, we use WMT14 dataset that comprises 35.5M sentence pairs. English and French have many aspects in common while English and German differ in word order, requiring a significant amount of reordering in translation. Besides, Chinese belongs to a different language family compared to the others.

Models
We conducted the experiments on the state-of-the-art Transformer (Vaswani et al., 2017), and implemented our approach with the opensource toolkit FairSeq (Ott et al., 2019). We fol-  Figure 2: Illustration of the information probing model, which reads the representation of a decoder module ("Input 1") and the word sequence to recover ("Input 2"), and outputs the generation probability ("Output").
low the setting of Transformer-Base in Vaswani et al. (2017), which consists of 6 stacked encoder/decoder layers with the model size being 512. We train our models on 8 NVIDIA P40 GPUs, where each is allocated with a batch size of 4,096 tokens. We use Adam optimizer (Kingma and Ba, 2015) with 4,000 warm-up steps. 3

Representation Evolution Across Layers
In order to quantify and visualize the representation evolution, we design a universal probing scheme to quantify the source (or target) information stored in network representations.
Task Description Intuitively, the more the source (or target) information stored in a network representation, the more probably a trained reconstructor could recover the source (or target) sequence. Since the lengths of source sequence and decoder representations are not necessarily the same, the widely-used classification-based probing approaches Tenney et al., 2019b) cannot be applied to this task. Accordingly, we cast this task as a generation problem -evaluating the likelihood of generating the word sequence conditioned on the input representation. Figure 2 illustrates the architecture of our probing scheme. Given a representation sequence from decoder H = {h 1 , . . . , h M } and the source (or target) word sequence to be recovered x = {x 1 , . . . , x N } the recovery likelihood is calculated as the perplexity (i.e. negative log-likelihood) of forced-decoding the word sequence: The lower the recovery perplexity, the more the source (or target) information stored in the representation. The probing model can be implemented as any architecture. For simplicity, we use a onelayer Transformer decoder. We train the probing model to recover both source and target sequence from all decoder sub-layer representations. During training, we fix the NMT model parameters and train the probing model on the MT training set to minimize the recovery perplexity in Equation 1.

Task Discussion
The above probing scheme is a general framework applicable to probing any given sequence from a network representation. When we probe for the source sequence, the probing model is analogous to an auto-encoder (Bourlard and Kamp, 1988;Vincent et al., 2010), which reconstructs the original input from the network representations. When we probe for the target sequence, we apply an attention mask to the probing decoder to avoid direct copying from the input of translation histories. Contrary to source probing, the target sequence is never seen by the model.
In addition, our proposed scheme can also be applied to probe linguistic properties that can be represented in a sequential format. For instance, we could probe source constituency parsing information, by training a probing model to recover the linearized parsing sequence (Vinyals et al., 2015). Due to space limitations, we leave the linguistic probing to future work. Figure 3 shows the results of our information probing conducted on the heldout set. We have a few observations:

Probing Results
• The evolution trends of TEM and IFM are largely the same. Specifically, the curve of TEM is very close to that of IFM shifted up by one layer. Since TEM representations are two operations (self-attn. and Add&Norm) away from the previous layer IFM, this observation indicates TEMs do not significantly affect the amount of source/target information. 4 • SEM guides both source and target information evolution. While closely observing the curves, the trend of layer representations (i.e. IFM) is always led by that of SEM. For example, as the PPL of SEM transitions from decreases to increases, the PPL of IFM slows down the decreases and starts increasing as an aftermath. This is intuitive: in machine translation, source and target sequences should contain equivalent information, thus the target generation should largely follow the lead of source information (from SEM representations) to guarantee its adequacy.
• For IFM, the amount of target information consistently increases in higher layers -a consistent decrease of PPL in Figures 3(df). While source information goes up in the lower layers, it drops in the highest layer (Figures 3(a-c)).
Since SEM representations are critical to decoder evolution, we turn to investigate how SEM exploit source information, in the hope of explaining the decoder information evolution.

Exploitation of Source Information
Ideally, SEM should accurately and fully incorporate the source information for the decoder. Ac- cordingly, we evaluate how well SEMs accomplish the expected functionality from two perspectives.
Word Alignment. Previous studies generally interpret the attention weights of SEM as word alignments between source and target words, which can measure whether SEMs select the most relevant part of source information for each target token (Tu et al., 2016;Tang et al., 2019b). We follow previous practice to merge attention weights from the SEM attention heads, and to extract word alignments by selecting the source word with the highest attention weight for each target word. We calculate the alignment error rate (AER) scores (Och and Ney, 2003) for word alignments extracted from SEM of each decoder layer.
Cumulative Coverage. Coverage is commonly used to evaluate whether the source words are fully translated (Tu et al., 2016;Kong et al., 2019). We use the above extracted word alignments to identify the set of source words A i , which are covered (i.e., aligned to at least one target word) at each layer. We then propose a new metric cumulative coverage ratio C ≤i to indicate how many source words are covered by the layers ≤ i: where N is the number of total source words. This metric indicates the completeness of source information coverage till layer i.

Dataset
We conducted experiments on two manually-labeled alignment datasets: RWTH En- De 5 and En-Zh (Liu and Sun, 2015). The alignments are extracted from NMT models trained on the WMT En-De and En-Zh dataset.
Results Figure 4 demonstrates our results on word alignment and cumulative coverage. We find that the lower-layer SEMs focus on gathering source contexts (rapid increase of cumulative coverage with poor word alignment), while higherlayer ones play the role of word alignment with the lowest AER score of less than 0.4 at the 5th layer. The 4 th layer and the 3 rd layer separate the two roles for En-De and En-Zh respectively. Correspondingly, they are also the turning points (PPL from decreases to increases) of source information evolution in Figure 3 (a,b). Together with conclusions from Sec. 3.1, we demonstrate the general pattern of SEM: SEM tends to cover more source content and gain increasing amount of source in-  formation up to a turning point of 3 rd or 4 th layer, after which it starts only attending to the most relevant source tokens and contains decreasing amount of total source information.
TEM Modules Since TEM representations serve as the query vector for encoder attention operations (shown in Figure 1), we naturally hypothesize that TEM is helping SEM on building alignments. To verify that, we remove TEM from the decoder ("SEM⇒IFM"), which significantly increases the alignment error from 0.37 to 0.54 (in Figure 5), and leads to a serious decrease of translation performance (BLEU: 27.45 ⇒ 22.76, in Table 1) on En-De, while results on En-Zh also confirms it (in Figure 6). This indicates that TEM is essential for building word alignment. However, reordering the stacking of TEM and SEM ("SEM⇒TEM⇒IFM") does not affect the alignment or translation qualities (BLEU: 27.45 vs. 27.61). These results provide empirical support for recent work on merging TEM and SEM modules .

Robustness to Decoder Depth
To verify the robustness of our conclusions, we vary the depth of NMT decoder and train it from scratch. Table 2 demonstrates the results on translation quality, which generally show that more decoder layers bring better performance. Figure 7 shows that SEM behaves similarly regardless of depth. These results demonstrate the robustness of our conclusions.

Information Fusion in Decoder
We now turn to the analysis of IFM. Within the Transformer decoder, IFM plays the critical role of fusing the source and target information by merg-   ing representations from SEM and TEM. To study the information fusion process, we conduct a more fine-grained analysis on IFM at the operation level.
Fine-Grained Analysis on IFM As shown in Figure 8(a), IFM contains three operations: • Add-Norm I linearly sums and normalizes the representations from SEM and TEM; • Feed-Forward non-linearly transforms the fused source and target representations; • Add-Norm II again linearly sums and normalizes the representations from the above two. Figures 8 (b) and (c) respectively illustrate the source and target information evolution within IFM. Surprisingly, Add-Norm I contains a similar amount of, if not more, source (and target) information than Add-Norm II , while the Feed-Forward curve deviates significantly from both. This indicates that the residual Feed-Forward operation may not affect the source (and target) information evolution, and one Add&Norm operation may be sufficient for information fusion.

IFM Analysis Results
Simplified Decoder To empirically demonstrate whether one Add&Norm operation is already sufficient, we remove all other operations, leaving just one Add&Norm operation for the IFM. The architectural change is illustrated in Figure 9(b), and we dub it the "simplified decoder".  Table 4 reports the translation performance of both architectures on all three major datasets, while Figure 10 illustrates the information evolution of both on WMT En-De. We find the simplified model reaches comparable performance with only a minimal drop of 0.1-0.3 BLEU on En-De and En-Fr, while observing 0.9 BLEU gains on En-Zh. 7 To further assess the translation performance, we manually evaluate 100 translations sampled from the En-Zh test set. On the scale of 1 to 5, we find that the simplified decoder obtains a fluency score of 4.01 and an adequacy score of 3.87, which is approximately equivalent to that of the standard decoder, i.e. 4.00 for fluency and 3.86 for adequacy (in Table 5).

Simplified Decoder Results
On the other hand, since the simplified decoder drops the operations (FeedForward) with most parameters (shown in Table 3), we also expect a significant increase on training and inference speeds. From Table 4, we confirm a consistent boost of both training and inference speeds by approximately 11-14%. To demonstrate the robustness, we also confirm our findings under Transformer big settings (Vaswani et al., 2017), whose results are  shown in Section A.2. The lower PPL in Figure 10 suggests that the simplified model also contains consistently more source and target information across its stacking layers.
Our results demonstrate that a single Add&Norm is indeed sufficient for IFM, and the simplified model reaches comparable performance with a significant parameter reduction and a noticable 11-14% boost on training and inference speed.

Related Work
Interpreting Encoder Representations Previous studies generally focus on interpreting the encoder representations by evaluating how informative they are for various linguistic tasks (Conneau et al., 2018;Tenney et al., 2019b), for both RNN models (Shi et al., 2016;Bisazza and Tump, 2018;Blevins et al., 2018) and Transformer models (Raganato et al., 2018;Tang et al., 2019a;Tenney et al., 2019a;. Although they found that a certain amount of linguistic information is captured by encoder representations, it is still unclear how much encoded information is used by the decoder. Our work bridges this gap by interpreting how the Transformer decoder exploits the encoded information. Interpreting Encoder Self-Attention In recent years, there has been a growing interest in inter-   Table 4: Performance of the simplified Base decoder. "#Train" denotes the training speed (words per second) and "#Infer." denotes the inference speed (sentences per second). Results are averages of three runs.
preting the behaviors of attention modules. Previous studies generally focus on the self-attention in the encoder, which is implemented as multi-head attention. For example, Li et al. (2018) showed that different attention heads in the encoder-side self-attention generally attend to the same position. Voita et al. (2019) and Michel et al. (2019) found that only a few attention heads play consistent and often linguistically-interpretable roles, and others can be pruned. Geng et al. (2020) empirically validated that a selective mechanism can mitigate the problem of word order encoding and structure modeling of encoder-side self-attention. In this work, we investigated the functionalities of decoder-side attention modules for exploiting both source and target information.
Interpreting Encoder Attention The encoderattention weights are generally employed to interpret the output predictions of NMT models. Recently, Jain and Wallace (2019) showed that atten-

Model
Fluency Adequacy Standard (Base) 4.00 3.86 Simplified (Base) 4.01 3.87 tion weights are weakly correlated with the contribution of source words to the prediction. He et al. (2019) used the integrated gradients to better estimate the contribution of source words. Related to our work,  and Tang et al. (2019b) also conducted word alignment analysis on the same De-En and Zh-En datasets with Transformer models 8 . We use similar techniques to examine word alignment in our context; however, we also introduce a forced-decoding-based probing task to closely examine the information flow.
Understanding and Improving NMT Recent work started to improve NMT based on the findings of interpretation. For instance, Belinkov et al. ( , 2018 pointed out that different layers prioritize different linguistic types, based on which Dou et al. (2018) and  simultaneously exposed all of these signals to the subsequent process.  explained why the decoder learns considerably less morphology than the encoder, and then explored to explicitly inject morphology in the decoder. Emelin et al. (2019) argued that the need to represent and propagate lexical features in each layer limits the model's capacity, and introduced gated shortcut connections between the embedding layer and each subsequent layer.  revealed that miscalibration remains a severe challenge for NMT during inference, and proposed a graduated label smoothing that can improve the inference calibration. In this work, based on our information probing analysis, we simplified the decoder by removing the residual feedforward module in totality, with minimal loss of translation quality and a significant boost of both training and inference speeds.

Conclusions
In this paper, we interpreted NMT Transformer decoder by assessing the evolution of both source and target information across layers and modules.
To this end, we investigated the information functionalities of decoder components in the translation process. Experimental results on three major datasets revealed several findings that help understand the behaviors of Transformer decoder from different perspectives. We hope that our analysis and findings could inspire architectural changes for further improvements, such as 1) improving the word alignment of higher SEMs by incorporating external alignment signals; 2) exploring the stacking order of SEM, TEM and IFM sub-layers, which may provide a more effective way to transform information; 3) further pruning redundant sub-layers for efficiency. Since our analysis approaches are not limited to the Transformer model, it is also interesting to explore other architectures such as RNMT (Chen et al., 2018), ConvS2S (Gehring et al., 2017), or on document-level NMT (Wang et al., 2017. In addition, our analysis methods can be applied to other sequence-to-sequence tasks such as summarization and grammar error correction, whose source and target sides are in the same language. We leave those tasks for future work.

A.1 Implementation Details
All transformer models are selected based on their loss on validation set, while evaluated and reported on the test set. For En-De and En-Fr models, we used newstest2013 as validation set and new-stest2014 as test set. For En-Zh models, we used newsdev2016 as validation set and newstest2017 as test set.
All three datasets follow the prepossessing steps from FairSeq 9 , which uses Moses tokenizer 10 , with a joint BPE of 40000 steps, while does not include lower-casing nor true-casing.
All models are evaluated with a beam size of 10. Before evaluating the BLEU score, we apply a postprocessing step, where En-De and En-Fr generations apply compound word splitting 11 , and En-Zh generations apply Chinese word splitting (into Chinese characters). All generations are then evaluated with Moses multi-bleu.perl script 12 against the golden references.

A.2 Transformer Big Results
We also compare the performance of the standard and simplified decoder under Transformer Big setting. Big models are trained on 4 NVIDIA V100 chips, where each is allocated with a batch size of 8,192 tokens. Other training schedules and hyperparameters are the same as standard (Vaswani et al., 2017). Also, our Transformer Base models are all trained with full precision (FP32), while Big models are all trained with half precision (FP16) for faster training.
Transformer Big results are shown in Table. 6. We could observe a more severe BLEU score drop with a more significant speed boosting under Big setting. This is very intuitive, compared to Base setting, the simplified decoder drops more parameters, while still trained under the same schedule as standard, thus escalating the training discrepancy. Unfortunately due to the resource limitation, we  Table 6: Performance of the simplified Big decoder. "#Train" denotes the training speed (words per second) and "#Infer." denotes the inference speed (sentences per second). could not afford hyper-parameter tuning for Transformer.

A.3 Additional En-Zh and En-Fr Plots
All experiments are conducted on three datasets (En-De, En-Zh and En-Fr), where we have similar findings. Due to space limits, we mainly demonstrate results on En-De task in our paper. In this section, we provide additional results on En-Zh and En-Fr if applicable.