How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures

With recent advances in network architectures for Neural Machine Translation (NMT) recurrent models have effectively been replaced by either convolutional or self-attentional approaches, such as in the Transformer. While the main innovation of the Transformer architecture is its use of self-attentional layers, there are several other aspects, such as attention with multiple heads and the use of many attention layers, that distinguish the model from previous baselines. In this work we take a fine-grained look at the different architectures for NMT. We introduce an Architecture Definition Language (ADL) allowing for a flexible combination of common building blocks. Making use of this language we show in experiments that one can bring recurrent and convolutional models very close to the Transformer performance by borrowing concepts from the Transformer architecture, but not using self-attention. Additionally, we find that self-attention is much more important on the encoder side than on the decoder side, where it can be replaced by a RNN or CNN without a loss in performance in most settings. Surprisingly, even a model without any target side self-attention performs well.

Recently, other approaches relying on convolutional networks (Kalchbrenner et al., 2016;Gehring et al., 2017) and self-attention (Vaswani et al., 2017) have been introduced. These approaches remove the dependency between source language time steps, leading to considerable speed-ups in training time and improvements in quality. The Transformer, however, contains other differences besides self-attention, including layer normalization across the entire model, multiple source attention mechanisms, a multi-head dot attention mechanism, and the use of residual feedforward layers. This raises the question of how much each of these components matters.
To answer this question we first introduce a flexible Architecture Definition Language (ADL) ( §2). In this language we standardize existing components in a consistent way making it easier to compare structural differences of architectures. Additionally, it allows us to efficiently perform a granular analysis of architectures, where we can evaluate the impact of individual components, rather than comparing entire architectures as a whole. This ability leads us to the following observations: • Source attention on lower encoder layers brings no additional benefit ( §4.2).
• Multiple source attention layers and residual feed-forward layers are key ( §4.3).
• Self-attention is more important for the source than for the target side ( §4.4).

Flexible Neural Machine Translation Architecture Combination
In order to experiment easily with different architecture variations we define a domain specific NMT Architecture Definition Language (ADL), consisting of combinable and nestable building blocks.

Neural Machine Translation
NMT is formulated as a sequence to sequence prediction task in which a source sentence X = x 1 , ..., x n is translated auto-regressively into a target sentence Y = y 1 , ..., y m one token at a time as where b o is a bias vector, W o projects a model dependent hidden vector z L of the Lth decoder layer to the dimension of the target vocabulary V trg and θ denotes the model parameters. Typically, during training Y 1:t−1 consists of the reference sequence tokens, rather then the predictions produced by the model, which is known as teacher-forcing. Training is done by minimizing the cross-entropy loss between the predicted and the reference sequence.

Architecture Definition Language
In the following we specify the ADL which can be used to define any standard NMT architecture and combinations thereof.
Layers The basic building block of the ADL is a layer l. Layers can be nested, meaning that a layer can consist of several sublayers. Layers optionally take set of named arguments l(k 1 =v 1 , k 2 =v 2 , ...) with names k 1 , k 2 , ... and values v 1 , v 2 , ... or positional arguments l(v 1 , v 2 , ...).
Layer definitions For each layer we have a corresponding layer definition based on the hidden states of the previous layer and any additional arguments. Specifically, each layer takes T hidden states h i 1 , ..., h i T , which in matrix form are H i ∈ R T ×d i , and produces a new set of hidden states h i+1 1 , ..., h i+1 T or H i+1 . While each layer can have a different number of hidden units d i , in the following we assume them to stay constant across layers and refer to the model dimensionality as d model . We distinguish the hidden states on the source side U 0 , ..., U Ls from the hidden states of the target side Z 0 , ..., Z L . These are produced by the source and target embeddings and L s source layers and L target layers.
Source attention layers play a special role in that their definition additionally makes use of any of the source hidden states U 0 , ..., U Ls .
Layer chaining Layers can be chained, feeding the output of one layer as the input to the next. We denote this as l 1 → l 2 ... l L . This is equivalent to writing l L (... l 2 (l 1 (H 0 ))) if none of the layers is a source attention layer.
In layer chains layers may also contain layers that themselves take arguments. As an example l 1 (k =v) → l 2 ... l L is equivalent to l L (... l 2 (l 1 (H 0 , k =v))). Note that unlike in the layer definition hidden states are not explicitly stated in the layer chain, but rather implicitly defined through the preceding layers.
Encoder/Decoder structure A NMT model is fully defined through two layer chains, namely one describing the encoder and another describing the decoder. The first layer hidden states on the source U 0 are defined through the source embedding as where x t ∈ {0, 1} |Vsrc| is the one-hot representation of x t and E S x t ∈ R e×|Vsrc| an embedding matrix with embedding dimensionality e. Similarly, Z 0 is defined through the target embedding matrix E tgt . Given the final decoder hidden state Z L the next word predictions are done according to Equation 1.
Layer repetition Networks often consist of substructures that are repeated several times. In order to support this we define a repetition layer as repeat(n, l) = l 1 l 2 ... l n , where l represents a layer chain and each one of l 1 , ..., l n an instantiation of that layer chain with a separate set of weights.

Layer Definitions
In this section we will introduce the concrete layers and their definitions, which are available for composing NMT architectures. They are based on building blocks common to many current NMT models.
Dropout A dropout (Srivastava et al., 2014) layer, denoted as dropout(h t ), can be applied to hidden states as a form of regularization.
Fixed positional embeddings Fixed positional embeddings (Vaswani et al., 2017) add information about the position in the sequence to the hidden states. With h t ∈ R d the positional embedding layer is defined as Linear We define a linear projection layer as Feed-forward Making use of the linear projection layer a feed-forward layer with ReLU activation and dropout is defined as and a version which temporarily upscales the number of hidden units, as done by Vaswani et al. (2017), can be defined as where h t ∈ R d in .

Convolution
Convolutions run a small feedforward network on a sliding window over the input. Formally, on the encoder side this is defined as where k is the kernel size, and v is a non-linearity. The input is padded so that the number of hidden states does not change.
To preserve the auto-regressive property of the decoder we need to make sure to never take future decoder time steps into account, which can be achieved by adding k − 1 padding vectors h −k+1 = 0, . . . , h −1 = 0 such that the decoder convolution is given as The non-linearity v can either be a ReLU or a Gated Linear Unit (GLU) (Dauphin et al., 2016). With the GLU we set d i = 2d such that we can split h = [h A ; h B ] ∈ R 2d and compute the nonlinearity as Identity We define an identity layer as Concatenation To concatenate the output of p layer chains we define Recurrent Neural Network An RNN layer is defined as where f rnn o and f rnn h could be defined through either a GRU  or a LSTM (Hochreiter and Schmidhuber, 1997) cell. In addition, a bidirectional RNN layer birnn is available, which runs one rnn in forward and another in reverse direction and concatenates both results.
Attention All attention mechanisms take a set of query vectors q 0 , ..., q M , key vectors k 0 , ..., k N and value vectors v 0 , ..., v N in order to produce one context vector per query, which is a linear combination of the value vectors. We define Q ∈ R M ×d , V ∈ R N ×d and K ∈ R N ×d as the concatenation of these vectors. What is used as the query, key and value vectors depends on attention type and is defined below.
Dot product attention The scaled dot product attention (Vaswani et al., 2017) is defined as where the scaling factor s is implicitly set to d unless noted otherwise. Adding a projection to the queries, keys and values we get the projected dot attention as Vaswani et al. (2017) further introduces a multihead attention, which applies multiple attentions at a reduced dimensionality. With h heads multihead attention is computed as Note that with h = 1 we recover the projected dot attention.

MLP attention
The MLP attention  computes the scores with a onelayer neural network as Source attention Using the source hidden vectors U, the source attentions are computed as Self-attention Self-attention (Vaswani et al., 2017) uses the hidden states as queries, keys and values such that Please note that on the target side one needs to make sure to preserve the auto-regressive property by only attending to hidden states at the current or past steps h < t, which is achieved by masking the attention mechanism.
Layer normalization Layer normalization (Ba et al., 2016) uses the mean and standard deviation for normalization. It is computed as where g and b are learned scale and shift parameters with the same dimensionality as h.
Residual layer A residual layer adds the output of an arbitrary layer chain l to the current hidden states. We define this as For convenience we also define res d(h t , l) = res(l(h t ) dropout) and res nd(h t , l) = res(norm l(h t ) dropout).

Standard Architectures
Having defined the common building blocks we now show how standard NMT architectures can be constructed.
RNMT As RNNs have been around the longest in NMT, several smaller architecture variations exist. Similar to Wu et al. (2016) in the following we use a bi-directional RNN followed by a stack of uni-directional RNNs with residual connections on the encoder side. Using the ADL an n layer encoder can be expressed as U Ls = dropout birnn repeat(n − 1, res d(rnn)).
For the decoder we use the architecture by Luong et al. (2015), which first runs a stacked RNN and then combines the context provided by a single attention mechanism with the hidden state provided by the RNN. This can be expressed by concat(id, mlp att) ff.
If input feeding (Luong et al., 2015) is used the first layer hidden states are redefined as Note that this inhibits any parallelism across decoder time steps. This is only an issue when using models other than RNNs, as RNNs already do not allow for parallelizing over decoder time steps.
ConvS2S Gehring et al. (2017) introduced a NMT model that fully relies on convolutions, both on the encoder and on the decoder side. The encoder is defined as U Ls = pos repeat(n, res(cnn(glu) dropout)) and the decoder, which uses an unscaled single head dot attention is defined as Z L = pos res(dropout cnn(glu) dropout res(dot src att(s=1))).
Note that unlike (Gehring et al., 2017) we do not project the query vectors before the attention and do not add the embeddings to the attention values.
Transformer The Transformer (Vaswani et al., 2017) makes use of self-attention, instead of RNNs or Convolutional Neural Networks (CNNs), as the basic computational block. Note that we use a slightly updated residual structure as implemented by tensor2tensor 1 than proposed originally. Specifically, layer normalization is applied to the input of the residual block instead of applying it between blocks. The Transformer uses a combination of self-attention and feed-forward layers on the encoder and additionally source attention layers on the decoder side. When defining the Transformer encoder block as t enc = res nd(mh dot self att) res nd(ffl), and the decoder block as t dec = res nd(mh dot self att) res nd(mh dot src att) res nd(ffl).
the Transformer encoder is given as U Ls = pos repeat(n, t enc ) norm and the decoder as Z L = pos repeat(n, t dec ) norm.

Related Work
The dot attention mechanism, now heavily used in the Transformer models, was introduced by (Luong et al., 2015) as part of an exploration of different attention mechanisms for RNN based NMT models. Britz et al. (2017) performed an extensive exploration of hyperparameters of RNN based NMT models. The variations explored include different attention mechanisms, RNN cells types and model depth.
Similar to our work, Schrimpf et al. (2017) define a language for exploring architectures. In this case the architectures are defined for RNN cells and not for the higher level model architecture. Using the language they perform an automatic search of RNN cell architectures.
For the application of image classification there have been several recent successful efforts of automatically searching for successful architectures (Zoph and Le, 2016;Negrinho and Gordon, 2017;Liu et al., 2017). 1 https://github.com/tensorflow/tensor2tensor

Experiments
What follows is an extensive empirical analysis of current NMT architectures and how certain sublayers as defined through our ADL affect performance.

Setup
All experiments were run with an adapted version of SOCKEYE (Hieber et al., 2017), which can parse arbitrary model definitions that are expressed in the language described in Section 2.3. The code and configuration are available at https://github.com/awslabs/sockeye/tree/acl18 allowing researchers to easily replicate the experiments and to quickly try new NMT architectures by either making use of existing building blocks in novel ways or adding new ones.
In order to get data points on corpora of different sizes we ran experiments on both WMT and IWSLT data sets. For WMT we ran the majority of our experiments on the most recent WMT'17 data consisting of roughly 5.9 million training sentences for English-German (EN→DE) and 4.5 million sentences for Latvian-English (LV→EN). We used newstest2016 as validation data and report metrics calculated on newstest2017. For the smaller IWSLT'16 English-German corpus, which consists of roughly 200 thousand training sentences, we used TED.tst2013 as validation data and report numbers for TED.tst2014.
For both WMT'17 and IWSLT'16 we preprocessed all data using the Moses 2 tokenizer and apply Byte Pair Encoding (BPE) (Sennrich et al., 2015) with 32,000 merge operations. Unless noted otherwise we run each experiment three times with different random seeds and report the mean and standard deviation of the BLEU and ME-TEOR (Lavie and Denkowski, 2009)  In order to compare to previous work, we also ran an additional experiment on WMT'14 using the same data as Vaswani et al. (2017) as provided in preprocessed form through ten-sor2tensor. 3 This data set consists of WMT'16 training data, which has been tokenized and byte pair encoded with 32,000 merge operations. Evaluation is done on tokenized and compound split newstest2014 data using multi-bleu.perl in order to get scores comparable to Vaswani et al. (2017). As seen in Table 1, our Transformer implementation achieves a score equivalent to the originally reported numbers.
On the smaller IWSLT data we use d model = 512 and on WMT d model = 256 for all models. Models are trained with 6 encoder and 6 decoder blocks, where in the Transformer model a layer refers to a full encoder or decoder block. All convolutional layers use a kernel of size 3 and a ReLU activation, unless noted otherwise. RNNs use LSTM cells. For training we use the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.0002. The learning rate is decayed by a factor of 0.7, whenever the validation perplexity does not improve for 8 consecutive checkpoints, where a checkpoint is created every 4,000 updates on WMT and 1,000 updates on IWSLT. All models use label smoothing (Szegedy et al., 2016) with ls = 0.1.

What to attend to?
Source attention is typically based on the top encoder block. With multiple source attention layers one could hypothesize that it could be beneficial to allow attention encoder blocks other than the top encoder block. It might for example be beneficial for lower decoder blocks to use encoder blocks from the same level as they represent the same level of abstraction. Inversely, assuming that the translation is done in a coarse to fine manner it might help to first use the uppermost encoder block and use gradually lower level representations.  The result of modifying the source attention mechanism to use different encoder blocks is shown in Table 2. The variations include using the result of the encoder Transformer block at the same level as the decoder Transformer block (increasing) and using the upper encoder Transformer block in the first decoder block and then gradually using the lower blocks (decreasing).
We can see that attention on the upper encoder block performs best and no gains can be observed by attention on different encoder layers in the source attention mechanism.

Network Structure
The Transformer sets itself apart from both standard RNN models and convolutional model by more than just the multi-head self-attention blocks.

RNN to Transformer
The differences to the RNN include the multiple source attention layers, multi-head attention, layer normalization and the residual upscaling feed-forward layers. Additionally, RNN models typically use single head MLP attention instead of the dot attention. This raises the question of what aspect contributes most to the performance of the Transformer. Table 3 shows the result of taking an RNN and step by step changing the architecture to be similar to the Transformer architecture. We start with a standard RNN architecture with MLP attention similar to Luong et al. (2015) as described in Section 2.4 with and without input feeding denoted as RNMT.
Next, we take a model with a residual connection around the encoder bi-RNN such that the encoder is defined as dropout res d(birnn) repeat(5, res d(rnn)).
The decoder uses a residual single head dot attention and no input feeding and is defined as dropout repeat(6, res d(rnn)) res d(dot src att) res d(ffl). Table 3. This model is then changed to use multi-head attention (mh), positional embeddings (pos), layer normalization on the inputs of the residual blocks (norm), an attention mechanism in a residual block after every RNN layer with multiple (multi-att) and a single head (multi-add-1h), and finally a residual  Table 3: Transforming an RNN into a Transformer style architecture. + shows the incrementally added variation. / denotes an alternative variation to which the subsequent + is relative to.

We denote this model as RNN in
upscaling feed-forward layer is added after each attention block (ff). The final architecture of the encoder after applying these variations is pos res nd(birnn) res nd(ffl) repeat(5, res nd(rnn) res nd(ffl) norm and of the decoder pos repeat(6, res nd(rnn) res nd(mh dot src att) res nd(ffl)) norm.
Comparing this to the Transformer as defined in Section 2.4 we note that the model is identical to the Transformer, except that each self-attention has been replaced by an RNN or bi-RNN. Table 3 shows that not using input feeding has a negative effect on the result, which however can be compensated by the explored model variations.
With just a single attention mechanism the model benefits from multiple attention heads. The gains are even larger when an attention mechanism is added to every layer. With multiple source attention mechanisms the benefit of multiple heads decreases. Layer normalization on the inputs of the residual blocks has a small negative effect in all settings and metrics. As RNNs can learn to encode positional information positional embeddings are not strictly necessary. Indeed, we can observe no gains but rather even a small drop in BLEU and METEOR for WMT'17 EN→DE when using them. Adding feed-forward layers leads to large and consistent performance boost. While the final model, which is a Transformer model where each self-attention has been replaced by an RNN, is able to make up for a large amount of the difference between the baseline and the Transformer, it is still outperformed by the Transformer. The largest gains come from multiple attention mechanisms and residual feed-forward layers.
CNN to Transformer While the convolutional models have much more in common with the Transformer than the RNN based models, there are still some notable differences. Like the Transformer, convolutional models have no dependency between decoder time steps during training, use multiple source attention mechanisms and use a slightly different residual structure, as seen in Section 2.4. The Transformer uses a multi-head scaled dot attention while the ConvS2S model uses an unscaled single head dot attention. Other differences include the use of layer normalization as well as residual feed-forward blocks in the Transformer.
The result of making a CNN based architecture more and more similar to the Transformer can be seen in Table 4. As a baseline we use a simple residual CNN structure with a residual single head dot attention. This is denoted as CNN in Table 4. On the encoder side we have pos repeat(6, res d(cnn)) and for the decoder pos repeat(6, res d(cnn) res d(dot src att)). This is similar to, but slightly simpler than, the ConvS2S model described in Section 2.4. In the experiments we explore both the GLU and ReLU as non-linearities for the CNN.
Adding layer normalization (norm), multi-head attention (mh) and upsampling residual feedforward layers (ff) we arrive at a model that is  identical to a Transformer where the self-attention layers have been replaced by CNNs. This means that we have the following architecture on the encoder pos repeat(6, res nd(cnn) res nd(ffl)) norm.
Whereas for the decoder we have pos repeat(6, res nd(cnn) res nd(mh dot src att) res nd(ffl)) norm.
While in the baseline the GLU activation works better than the ReLU activation, when layer normalization, multi-head attention attention and residual feed-forward layers are added, the performance is similar. Except for IWSLT multi-head attention gives consistent gains over single head attention. The largest gains can however be observed by the addition of residual feed-forward layers. The performance of the final model, which is very similar to a Transformer where each selfattention has been replaced by a CNN, matches the performance of the Transformer on IWSLT EN→DE but is still 0.7 BLEU points worse on WMT'17 EN→DE and two BLEU points on WMT'17 LV→EN.

Self-attention variations
At the core of the Transformer are self-attentional layers, which take the role previously occupied by RNNs and CNNs. Self-attention has the advantage that any two positions are directly connected and that, similar to CNNs, there are no dependencies between consecutive time steps so that the computation can be fully parallelized across time. One disadvantage is that relative positional information is not directly represented and one needs to rely on the different heads to make up for this. In a CNN information is constrained to a local window which grows linearly with depth. Relative positions are therefore taken into account. While an RNN keeps an internal state, which can be used in future time steps, it is unclear how well this works for very long range dependencies (Koehn and Knowles, 2017;Bentivogli et al., 2016). Additionally, having a dependency on the previous hidden state inhibits any parallelization across time.
Given the different advantages and disadvantages we selectively replace self-attention on the encoder and decoder side in order to see where the model benefits most from self-attention.
We take the encoder and decoder block defined in Section 2.4 and try out different layers in place of the self-attention. Concretely, we have t enc = res nd(x enc ) res nd(ffl), on the encoder side and t dec = res nd(x dec ) res nd(mh dot src att) res nd(ffl).
on the decoder side. Table 5 shows the result of replacing x enc and x dec with either self-attention, a CNN with ReLU activation or an RNN. Notice that with self-attention used in both x enc and x dec we recover the Transformer model. Additionally, we remove the residual block on the decoder side entirely (none). This results in a decoder block which only has information about the previous target word y t through the word embedding that is fed as the input to the first layer. The decoder block is reduced to  In addition to that, we try a combination where the first and fourth block use self-attention, the second and fifth an RNN, the third and sixth a CNN (combined).
Replacing the self-attention on both the encoder and the decoder side with an RNN or CNN results in a degradation of performance. In most settings, such as WMT'17 EN→DE for both variations and WMT'17 LV→EN for the RNN, the performance is comparable when replacing the decoder side self-attention. For the encoder however, except for IWSLT, we see a drop in performance of up to 1.5 BLEU points when not using self-attention. Therefore, self-attention seems to be more important on the encoder side than on the decoder side. Despite the disadvantage of having a limited context window, the CNN performs as well as self-attention on the decoder side on IWLT and WMT'17 EN→DE in terms of BLEU and only slightly worse in terms of METEOR. The combination of the three mechanisms (combined) on the decoder side performs almost identical to the full Transformer model, except for IWSLT where it is slightly worse.
It is surprising how well the model works without any self-attention as the decoder essentially looses any information about the history of generated words. Translations are entirely based on the previous word, provided through the target side word embedding, and the current position, provided through the positional embedding.

Conclusion
We described an ADL for specifying NMT architectures based on composable building blocks. Instead of committing to a single architecture, the language allows for combining architectures on a granular level. Using this language we explored how specific aspects of the Transformer architecture can successfully be applied to RNNs and CNNs. We performed an extensive evaluation on IWSLT EN→DE, WMT'17 EN→DE and LV→EN, reporting both BLEU and METEOR over multiple runs in each setting.
We found that RNN based models benefit from multiple source attention mechanisms and residual feed-forward blocks. CNN based models on the other hand can be improved through layer normalization and also feed-forward blocks. These variations bring the RNN and CNN based models close to the Transformer. Furthermore, we showed that one can successfully combine architectures. We found that self-attention is much more important on the encoder side than it is on the decoder side, where even a model without self-attention performed surprisingly well. For the data sets we evaluated on, models with self-attention on the encoder side and either an RNN or CNN on the decoder side performed competitively to the Transformer model in most cases.
We make our implementation available so that it can be used for exploring novel architecture variations.