Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures

Recently, non-recurrent architectures (convolutional, self-attentional) have outperformed RNNs in neural machine translation. CNNs and self-attentional networks can connect distant words via shorter network paths than RNNs, and it has been speculated that this improves their ability to model long-range dependencies. However, this theoretical argument has not been tested empirically, nor have alternative explanations for their strong performance been explored in-depth. We hypothesize that the strong performance of CNNs and self-attentional networks could also be due to their ability to extract semantic features from the source text, and we evaluate RNNs, CNNs and self-attention networks on two tasks: subject-verb agreement (where capturing long-range dependencies is required) and word sense disambiguation (where semantic feature extraction is required). Our experimental results show that: 1) self-attentional networks and CNNs do not outperform RNNs in modeling subject-verb agreement over long distances; 2) self-attentional networks perform distinctly better than RNNs and CNNs on word sense disambiguation.

Recurrent neural networks (RNNs) (Elman, 1990) can easily deal with variable-length input sentences and thus are a natural choice for the encoder and decoder of NMT systems. Modern variants of RNNs, such as GRUs (Cho et al., 2014) and LSTMs (Hochreiter and Schmidhuber, 1997), address the difficulty of training recurrent networks with long-range dependencies. Gehring et al. (2017) introduce a neural architecture where both the encoder and decoder are based on CNNs, and report better BLEU scores than RNN-based NMT models. Moreover, the computation over all tokens can be fully parallelized during training, which increases efficiency. Vaswani et al. (2017) propose Transformer models, which are built entirely with attention layers, without convolution or recurrence. They report new state-of-art BLEU scores for EN→DE and EN→FR. Yet, the BLEU metric is quite coarse-grained, and offers no insight as to which aspects of translation are improved by different architectures.
To explain the observed improvements in BLEU, previous work has drawn on theoretical arguments. Both Gehring et al. (2017) and Vaswani et al. (2017) argue that the length of the paths in neural networks between co-dependent elements affects the ability to learn these dependencies: the shorter the path, the easier the model learns such dependencies. The papers argue that Transformers and CNNs are better suited than RNNs to capture long-range dependencies.
However, this claim is based on a theoretical argument and has not been empirically tested. We argue other abilities of non-recurrent networks could be responsible for their strong performance. Specifically, we hypothesize that the improvements in BLEU are due to CNNs and Transformers being strong semantic feature extractors.
In this paper, we evaluate all three popular NMT architectures: models based on RNNs (referred to as RNNS2S in the remainder of the paper), based on CNNs (referred to as ConvS2S) and self-attentional models (referred to as Transformers). Motivated by the aforementioned theoretical claims regarding path length and semantic feature extraction, we evaluate their performance on a subject-verb agreement task (that requires modeling long-range dependencies) and a word sense disambiguation (WSD) task (that requires extracting semantic features). Both tasks build on test sets of contrastive translation pairs, Lingeval97 (Sennrich, 2017) and ContraWSD (Rios et al., 2017).
The main contributions of this paper can be summarized as follows: • We test the theoretical claims that architectures with shorter paths through networks are better at capturing long-range dependencies. Our experimental results on modeling subject-verb agreement over long distances do not show any evidence that Transformers or CNNs are superior to RNNs in this regard.
• We empirically show that the number of attention heads in Transformers impacts their ability to capture long-distance dependencies. Specifically, many-headed multi-head attention is essential for modeling longdistance phenomena with only self-attention.
• We empirically show that Transformers excel at WSD, indicating that they are strong semantic feature extractors.
2 Related work Yin et al. (2017) are the first to compare CNNs, LSTMs and GRUs on several NLP tasks. They find that CNNs are better at tasks related to semantics, while RNNs are better at syntax-related tasks, especially for longer sentences. Based on the work of Linzen et al. (2016), Bernardy and Lappin (2017) find that RNNs perform better than CNNs on a subject-verb agreement task, which is a good proxy for how well long-range dependencies are captured. Tran et al. (2018) find that a Transformer language model performs worse than an RNN language model on a subject-verb agreement task. They, too, note that this is especially true as the distance between subject and verb grows, even if RNNs resulted in a higher perplexity on the validation set. This result of Tran et al. (2018) is clearly in contrast to the general finding that Transformers are better than RNNs for NMT tasks. Bai et al. (2018) evaluate CNNs and LSTMs on several sequence modeling tasks. They conclude that CNNs are better than RNNs for sequence modeling. However, their CNN models perform much worse than the state-of-art LSTM models on some sequence modeling tasks, as they themselves state in the appendix. Tang et al. (2018) evaluate different RNN architectures and Transformer models on the task of historical spelling normalization which translates a historical spelling into its modern form. They find that Transformer models surpass RNN models only in high-resource conditions.
In contrast to previous studies, we focus on the machine translation task, where architecture comparisons so far are mostly based on BLEU.

NMT Architectures
We evaluate three different NMT architectures: RNN-based models, CNN-based models, and Transformer-based models. All of them have a bipartite structure in the sense that they consist of an encoder and a decoder. The encoder and the decoder interact via a soft-attention mechanism (Bahdanau et al., 2015;Luong et al., 2015), with one or multiple attention layers.
In the following sections, h l i is the hidden state at step i of layer l, h l i−1 represents the hidden state at the previous step of layer l, h l−1 i means the hidden state at i of l − 1 layer, E x i represents the embedding of x i , and e pos,i denotes the positional embedding at position i.

RNN-based NMT
RNNs are stateful networks that change as new inputs are fed to them, and each state has a direct connection only to the previous state. Thus, the path length of any two tokens with a distance of n in RNNs is exactly n. Figure 1 (a) shows an illustration of RNNs.
In deep architectures, two adjacent layers are commonly connected with residual connections. In the lth encoder layer, h l i is generated by Equation 1, x 1 x 2 x 3 x 4 x 5 padding padding (b) CNN x 1 x 2 x 3 x 4 x 5 (c) Self-attention where f rnn is the RNN (GRU or LSTM) function.
In the first layer, . In addition to the connection between the encoder and decoder via attention, the initial state of the decoder is usually initialized with the average of the hidden states or the last hidden state of the encoder.

CNN-based NMT
CNNs are hierarchical networks, in that convolution layers capture local correlations. The local context size depends on the size of the kernel and the number of layers. In order to keep the output the same length as the input, CNN models add padding symbols to input sequences. Given an Llayer CNN with a kernel size k, the largest context size is L(k −1). For any two tokens in a local context with a distance of n, the path between them is only n/(k − 1) .
As Figure 1 (b) shows, a 2-layer CNN with kernel size 3 "sees" an effective local context of 5 tokens. The path between the first token and the fifth token is only 2 convolutions. 1 Since CNNs do not have a means to infer the position of elements in a sequence, positional embeddings are introduced.
The hidden state h l i shown in Equation 2 is related to the hidden states in the same convolution and the hidden state h l−1 i from the previous layer. k denotes the kernel size of CNNs and f cnn is a nonlinearity. ConvS2S chooses Gated Linear Units (GLU) which can be viewed as a gated variation of ReLUs. W l are called convolutional filters. In the input layer, h 0 i = E x i + e pos,i .
1 Note that the decoder employs masking to avoid conditioning the model on future information, which reduces the effective context size to L k−1 2 .

Transformer-based NMT
Transformers rely heavily on self-attention networks. Each token is connected to any other token in the same sentence directly via selfattention. Moreover, Transformers feature attention networks with multiple attention heads.
Multi-head attention is more fine-grained, compared to conventional 1-head attention mechanisms. Figure 1 (c) illustrates that any two tokens are connected directly: the path length between the first and the fifth tokens is 1. Similar to CNNs, positional information is also preserved in positional embeddings. The hidden state in the Transformer encoder is calculated from all hidden states of the previous layer. The hidden state h l i in a self-attention network is computed as in Equation 3.
where f represents a feedforward network with ReLU as the activation function and layer normalization. In the input layer, h 0 i = E x i + e pos,i . The decoder additionally has a multi-head attention over the encoder hidden states.

Contrastive Evaluation of Machine Translation
Since we evaluate different NMT architectures explicitly on subject-verb agreement and WSD (both happen implicitly during machine translation), BLEU as a measure of overall translation quality is not helpful. In order to conduct these targeted evaluations, we use contrastive test sets. Sets of contrastive translations can be used to analyze specific types of errors. Human reference translations are paired with one or more contrastive variants, where a specific type of error is introduced automatically.
The evaluation procedure then exploits the fact that NMT models are conditional language models. By virtue of this, given any source sentence S and target sentence T , any NMT model can assign to them a probability P (T |S). If a model assigns a higher score to the correct target sentence than to a contrastive variant that contains an error, we consider it a correct decision. The accuracy of a model on such a test set is simply the percentage of cases where the correct target sentence is scored higher than all contrastive variants.
Contrastive evaluation tests the sensitivity of NMT models to specific translation errors. The contrastive examples are designed to capture specific translation errors rather than evaluating the global quality of NMT models. Although they do not replace metrics such as BLEU, they give further insights into the performance of models, on specific linguistic phenomena.

Lingeval97
Lingeval97 has over 97,000 English→German contrastive translation pairs featuring different linguistic phenomena, including subject-verb agreement, noun phrase agreement, separable verbparticle constructions, transliterations and polarity. In this paper, we are interested in evaluating the performance on long-range dependencies. Thus, we focus on the subject-verb agreement category which consists of 35,105 instances.
In German, verbs must agree with their subjects in both grammatical number and person. Therefore, in a contrastive translation, the grammatical number of a verb is swapped.

ContraWSD
In ContraWSD, given an ambiguous word in the source sentence, the correct translation is replaced by another meaning of the ambiguous word which is incorrect. For example, in a case where the English word line is the correct translation of the German source word Schlange, ContraWSD replaces line with the other translations of Schlange, such as snake, serpent, to generate contrastive translations.
For German→English, ContraWSD contains 84 different German word senses. It has 7,200 German→English lexical ambiguities, each lexical ambiguity instance has 3.5 contrastive translations on average. For German→French, it consists of 71 different German word senses. There are 6,700 German→French lexical ambiguities, with an average of 2.2 contrastive translations each lexical ambiguity instance. All the ambiguous words are nouns so that the disambiguation is not possible simply based on syntactic context.

Subject-verb Agreement
The subject-verb agreement task is the most popular choice for evaluating the ability to capture long-range dependencies and has been used in many studies (Linzen et al., 2016;Bernardy and Lappin, 2017;Sennrich, 2017;Tran et al., 2018). Thus, we also use this task to evaluate different NMT architectures on long-range dependencies.

Experimental Settings
Different architectures are hard to compare fairly because many factors affect performance. We aim to create a level playing field for the comparison by training with the same toolkit, Sockeye (Hieber et al., 2017) which is based on MXNet (Chen et al., 2015). In addition, different hyperparameters and training techniques (such as label smoothing or layer normalization) have been found to affect the performance (Chen et al., 2018). We apply the same hyperparameters and techniques for all architectures except the parameters of each specific architecture. Since the best hyperparameters for different architectures may be diverse, we verify our hyperparameter choice by comparing our results to those published previously. Our models achieve similar performance to that reported by Hieber et al. (2017) with the best available settings. In addition, we extend Sockeye with an interface that enables scoring of existing translations, which is required for contrastive evaluation.
All the models are trained with 2 GPUs. During training, each mini-batch contains 4096 tokens. A model checkpoint is saved every 4,000 updates. We use Adam (Kingma and Ba, 2015) as the optimizer. The initial learning rate is set to 0.0002. If the performance on the validation set has not improved for 8 checkpoints, the learning rate is multiplied by 0.7. We set the early stopping patience to 32 checkpoints. All the neural networks have 8 layers. For RNNS2S, the encoder has 1 bi-directional LSTM and 6 stacked uni-directional LSTMs, and the decoder is a stack of 8 uni-directional LSTMs. The size of embeddings and hidden states is 512. We apply layer normalization and label smoothing (0.1) in all models. We tie the source and target embeddings. The dropout rate of embeddings and Transformer blocks is set to 0.1. The dropout rate of RNNs and CNNs is 0.2. The kernel size of CNNs is 3. Transformers have an 8-head attention mechanism.
To test the robustness of our findings, we also test a different style of RNN architecture, from a different toolkit. We evaluate bi-deep transitional RNNs ( We use training data from the WMT17 shared task. 2 We use newstest2013 as the validation set, and use newstest2014 and newstest2017 as the test sets. All BLEU scores are computed with Sacre-BLEU (Post, 2018). There are about 5.9 million sentence pairs in the training set after preprocessing with Moses scripts. We learn a joint BPE model with 32,000 subword units (Sennrich et al., 2016). We employ the model that has the best perplexity on the validation set for the evaluation. Table 2 reports the BLEU scores on newstest2014 and newstest2017, the perplexity on the validation set, and the accuracy on long-range dependencies. 3 Transformer achieves the highest accuracy on this task and the highest BLEU scores on both newstest2014 and newstest2017. Compared to RNNS2S, ConvS2S has slightly better results regarding BLEU scores, but a much lower accuracy on long-range dependencies. The RNN-bideep model achieves distinctly better BLEU scores and a higher accuracy on long-range dependencies.

Overall Results
2 http://www.statmt.org/wmt17/ translation-task.html 3 We report average accuracy on instances where the distance between subject and verb is longer than 10 words.   Figure 2 shows the performance of different architectures on the subject-verb agreement task. It is evident that Transformer, RNNS2S, and RNNbideep perform much better than ConvS2S on long-range dependencies. However, Transformer, RNNS2S, and RNN-bideep are all robust over long distances. Transformer outperforms RNN-bideep for distances 11-12, but RNN-bideep performs equally or better for distance 13 or higher. Thus, we cannot conclude that Transformer models are particularly stronger than RNN models for long distances, despite achieving higher average accuracy on distances above 10.

CNNs
Theoretically, the performance of CNNs will drop when the distance between the subject and the verb exceeds the local context size. However, ConvS2S is also clearly worse than RNNS2S for subject-verb agreement within the local context size. In order to explore how the ability of ConvS2S to capture long-range dependencies depends on the local context size, we train additional systems, varying the number of layers and kernel size. Table 3 shows the performance of different ConvS2S models. Figure 3 displays the performance of two 8-layer CNNs with kernel size 3 and 7, a 6-layer CNN with kernel size 3, and RNNS2S. The results indicate that the accuracy increases when the local context size becomes larger, but the BLEU score does not. Moreover, ConvS2S is still not as good as RNNS2S for subject-verb agreement.   Regarding the explanation for the poor performance of ConvS2S, we identify the limited context size as a major problem. One assumption to explain the remaining difference is that, scale invariance of CNNs is relatively poor (Xu et al., 2014). Scale-invariance is important in NLP, where the distance between arguments is flexible, and current recurrent or attentional architectures are better suited to handle this variance. Our empirical results do not confirm the theoretical arguments in Gehring et al. (2017) that CNNs can capture long-range dependencies better with a shorter path. The BLEU score does not correlate well with the targeted evaluation of long-range distance interactions. This is due to the locality of BLEU, which only measures on the level of ngrams, but it may also indicate that there are other trade-offs between the modeling of different phenomena depending on hyperparameters. If we aim to get better performance on long-range dependencies, we can take this into account when optimizing hyperparameters.

RNNs vs. Transformer
Even though Transformer achieves much better BLEU scores than RNNS2S and RNN-bideep, the accuracies of these architectures on long-range dependencies are close to each other in Figure 2.
Our experimental result contrasts with the result from Tran et al. (2018). They find that Transformers perform worse than LSTMs on the subjectverb agreement task, especially when the distance between the subject and the verb becomes longer. We perform several experiments to analyze this discrepancy with Tran et al. (2018).
A first hypothesis is that this is caused by the amount of training data, since we used much larger datasets than Tran et al. (2018). We retrain all the models with a small amount of training data similar to the amount used by Tran et al. (2018), about 135K sentence pairs. The other training settings are the same as in Section 4.1. We do not see the expected degradation of Transformer-s, compared to RNNS2S-s (see Figure 4). In Table 4, the performance of RNNS2S-s and Transformer-s is similar, including the BLEU scores on newstest2014, new-stest2017, the perplexity on the validation set, and the accuracy on the long-range dependencies.  A second hypothesis is that the experimental settings lead to the different results. In order to investigate this, we do not only use a small training set, but also replicate the experimental settings of Tran et al. (2018). The main changes are neural network layers (8→4); embedding size (512→128); multihead size (8→2); dropout rate (0.1→0.2); checkpoint save frequency (4,000→1,000), and initial learning rate (0.0002→0.001).
In the end, we get a result that is similar to Tran et al. (2018). In Figure 5, Transformer-re-h2 performs clearly worse than RNNS2S-re on longrange dependencies. By increasing the number of heads in multi-head attention, subject-verb accuracy over long distances can be improved substantially, even though it remains below that of RNNS2S-re. Also, the effect on BLEU is small. Our results suggest that the importance of multihead attention with a large number of heads is larger than BLEU would suggest, especially for the modeling of long-distance phenomena, since multi-head attention provides a way for the model to attend to both local and distant context, whereas distant context may be overshadowed by local context in an attention mechanism with a single or few heads.
Although our study is not a replication of Tran et al. (2018), who work on a different task and a different test set, our results do suggest an alternative interpretation of their findings, namely that the poor performance of the Transformer in their experiments is due to hyperparameter choice. Rather than concluding that RNNs are superior to Transformers for the modeling of long-range dependency phenomena, we find that the number of heads in multi-head attention affects the ability of Transformers to model long-range dependencies in subject-verb agreement.

WSD
Our experimental results on the subject-verb agreement task demonstrate that CNNs and Transformer are not better at capturing long-range dependencies compared to RNNs, even though the paths in CNNs and Transformers are shorter. This finding is not in accord with the theoretical argument in both Gehring et al. (2017) and Vaswani et al. (2017). However, these architectures perform well empirically according to BLEU. Thus, we further evaluate these architectures on WSD, to test our hypothesis that non-recurrent architectures are better at extracting semantic features.

Experimental settings
We evaluate all architectures on ContraWSD on both DE→EN and DE→FR. We reuse the parameter settings in Section 4.1, except that: the initial learning rate of ConvS2S is reduced from 0.0003 to 0.0002 in DE→EN; the checkpoint saving frequency is changed from 4,000 to 1,000 in DE→FR because of the training data size.
For DE→EN, the training set, validation set, and test set are the same as the other direction EN→DE. For DE→FR, we use around 2.1 million sentence pairs from Europarl (v7) (Tiedemann, 2012) 4 and News Commentary (v11) cleaned by Rios et al. (2017) 5 as our training set. We use newstest2013 as the evaluation set, and use new-stest2012 as the test set. All the data is preprocessed with Moses scripts.  In addition, we also compare to the best result reported for DE→EN, achieved by uedin-wmt17 , which is an ensemble of 4 different models and reranked with right-to-left models. 6 uedin-wmt17 is based on the bi-deep RNNs (Miceli Barone et al., 2017) that we mentioned before. To the original 5.9 million sentence pairs in the training set, they add 10 million synthetic pairs with back-translation. Table 5 gives the performance of all the architectures, including the perplexity on validation sets, the BLEU scores on newstest, and the accuracy on ContraWSD. Transformers distinctly outperform RNNS2S and ConvS2S models on DE→EN and DE→FR. Moreover, the Transformer model on DE→EN also achieves higher accuracy than uedin-wmt17, although the BLEU score on new-stest2017 is 1.4 lower than uedin-wmt17. We attribute this discrepancy between BLEU and WSD performance to the use of synthetic news training data in uedin-wmt17, which causes a large boost in BLEU due to better domain adaptation to newstest, but which is less helpful for ContraWSD, whose test set is drawn from a variety of domains.

Overall Results
For DE→EN, RNNS2S and ConvS2S have the same BLEU score on newstest2014, ConvS2S has a higher score on newstest2017. However, the WSD accuracy of ConvS2S is 1.7% lower than RNNS2S. For DE→FR, ConvS2S achieves slightly better results on both BLEU scores and accuracy than RNNS2S.
The Transformer model strongly outperforms the other architectures on this WSD task, with a gap of 4-8 percentage points. This affirms our hypothesis that Transformers are strong semantic features extractors.

Hybrid Encoder-Decoder Model
In recent work, Chen et al. (2018) find that hybrid architectures with a Transformer encoder and an RNN decoder can outperform a pure Transformer model. They speculate that the Transformer encoder is better at encoding or extracting features than the RNN encoder, whereas the RNN is better at conditional language modeling.
For WSD, it is unclear whether the most important component is the encoder, the decoder, or both. Following the hypothesis that Transformer encoders excel as semantic feature extractors, we train a hybrid encoder-decoder model (TransRNN) with a Transformer encoder and an RNN decoder.
The results (in Table 5) show that TransRNN performs better than RNNS2S, but worse than the pure Transformer, both in terms of BLEU and WSD accuracy. This indicates that WSD is not only done in the encoder, but that the decoder also affects WSD performance. We note that Chen et al. (2018); Domhan (2018) introduce the techniques in Transformers into RNN-based models, with reportedly higher BLEU. Thus, it would be interesting to see if the same result holds true with their architectures.

Conclusion
In this paper, we evaluate three popular NMT architectures, RNNS2S, ConvS2S, and Transformers, on subject-verb agreement and WSD by scoring contrastive translation pairs.
We test the theoretical claims that shorter path lengths make models better capture long-range dependencies. Our experimental results show that: • There is no evidence that CNNs and Transformers, which have shorter paths through networks, are empirically superior to RNNs in modeling subject-verb agreement over long distances.
• The number of heads in multi-head attention affects the ability of a Transformer to model long-range dependencies in the subject-verb agreement task.
• Transformer models excel at another task, WSD, compared to the CNN and RNN architectures we tested.
Lastly, our findings suggest that assessing the performance of NMT architectures means finding their inherent trade-offs, rather than simply computing their overall BLEU score. A clear understanding of those strengths and weaknesses is important to guide further work. Specifically, given the idiosyncratic limitations of recurrent and selfattentional models, combining them is an exciting line of research. The apparent weakness of CNN architectures on long-distance phenomena is also a problem worth tackling, and we can find inspiration from related work in computer vision (Xu et al., 2014).