Transformer Dissection: An Unified Understanding for Transformer’s Attention via the Lens of Kernel

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the attention mechanism, which concurrently processes all inputs in the streams. In this paper, we present a new formulation of attention via the lens of the kernel. To be more precise, we realize that the attention can be seen as applying kernel smoother over the inputs with the kernel scores being the similarities between inputs. This new formulation gives us a better way to understand individual components of the Transformer’s attention, such as the better way to integrate the positional embedding. Another important advantage of our kernel-based formulation is that it paves the way to a larger space of composing Transformer’s attention. As an example, we propose a new variant of Transformer’s attention which models the input as a product of symmetric kernels. This approach achieves competitive performance to the current state of the art model with less computation. In our experiments, we empirically study different kernel construction strategies on two widely used tasks: neural machine translation and sequence prediction.


Introduction
Transformer (Vaswani et al., 2017) is a relative new architecture which outperforms traditional deep learning models such as Recurrent Neural Networks (RNNs) (Sutskever et al., 2014) and Temporal Convolutional Networks (TCNs) (Bai et al., 2018) for sequence modeling tasks across neural machine translations (Vaswani et al., 2017), language understanding (Devlin et al., 2018), sequence prediction (Dai et al., 2019), image generation (Child et al., 2019), video activity classification , music generation (Huang et al., 2018a), and multimodal sentiment analysis (Tsai et al., 2019a). Instead of performing recurrence (e.g., RNN) or convolution (e.g., TCN) over the sequences, Transformer is a feed-forward model that concurrently processes the entire sequence. At the core of the Transformer is its attention mechanism, which is proposed to integrate the dependencies between the inputs. There are up to three types of attention within the full Transformer model as exemplified with neural machine translation application (Vaswani et al., 2017): 1) Encoder self-attention considers the source sentence as input, generating a sequence of encoded representations, where each encoded token has a global dependency with other tokens in the input sequence. 2) Decoder self-attention considers the target sentence (e.g., predicted target sequence for translation) as input, generating a sequence of decoded representations 1 , where each decoded token depends on previous decoded tokens. 3) Decoder-encoder attention considers both encoded and decoded sequences, generating a sequence with the same length as the decoded sequence. It should be noted that some applications has only the decoder self-attention such as sequence prediction (Dai et al., 2019). In all cases, the Transformer's attentions follow the same general mechanism.
At the high level, the attention can be seen as a weighted combination of the input sequence, where the weights are determined by the similarities between elements of the input sequence. We note that this operation is order-agnostic to the permutation in the input sequence (order is encoded with extra positional embedding (Vaswani et al., 2017;Dai et al., 2019)). The above observation inspires us to connect Transformer's attention to kernel learning (Scholkopf and Smola, 2001): they both concurrently and order-agnostically process all inputs by calculating the similarity between the inputs. Therefore, in the paper, we present a new formulation for Transformer's attention via the lens of kernel. To be more precise, the new formulation can be interpreted as a kernel smoother (Wasserman, 2006) over the inputs in a sequence, where the kernel measures how similar two different inputs are. The main advantage of connecting attention to kernel is that it opens up a new family of attention mechanisms that can relate to the well-established literature in kernel learning (Scholkopf and Smola, 2001). As a result, we develop a new variant of attention which simply considers a product of symmetric kernels when modeling non-positional and positional embedding.
Furthermore, our proposed formulation highlights naturally the main components of Transformer's attention, enabling a better understanding of this mechanism: recent variants of Transformers Huang et al., 2018b;Dai et al., 2019;Child et al., 2019;Tsai et al., 2019a) can be expressed through these individual components. Among all the components, we argue that the most important one is the construction of the kernel function. We empirically study multiple kernel forms and the ways to integrate positional embedding in neural machine translation (NMT) using IWSLT'14 German-English (De-En) dataset (Edunov et al., 2017) and sequence prediction (SP) using WikiText-103 dataset (Merity et al., 2016).

Attention
This section aims at providing an understanding of attention in Transformer via the lens of kernel. The inspiration for connecting the kernel (Scholkopf and Smola, 2001) and attention instantiates from the observation: both operations concurrently processes all inputs and calculate the similarity between the inputs. We first introduce the background (i.e., the original formulation) of attention and then provide a new reformulation within the class of kernel smoothers (Wasserman, 2006). Next, we show that this new formulation allows us to explore new family of attention while at the same time offering a framework to cate-gorize previous attention variants (Vaswani et al., 2017;Huang et al., 2018b;Dai et al., 2019;Child et al., 2019;Tsai et al., 2019a). Last, we present a new form of attention, which requires fewer parameters and empirically reaches competitive performance as the state-of-the-art models.
For notation, we use lowercase representing a vector (e.g., x), bold lowercase representing a matrix (e.g., x), calligraphy letter denoting a space (e.g., X ), and S denoting a set. To relate the notations in sequence to sequence learning (Vaswani et al., 2017), x represents a specific element of a sequence, x = [x 1 , x 2 , ⋯, x T ] denotes a sequence of features, S x = {x exp , x 2 , ⋯, x T } represents the set with its elements being the features in sequence x, and we refer the space of set S x as S.

Technical Background
Unlike recurrent computation (Sutskever et al., 2014) (i.e., RNNs) and temporal convolutional computation (Bai et al., 2018) (i.e., TCNs), Transformer's attention is an order-agnostic operation given the order in the inputs (Vaswani et al., 2017). Hence, in the presentation of the paper, we consider the inputs as a set instead of a sequence. When viewing sequence as a set, we lose the temporal (positional) information in inputs which is often crucial for sequence modeling (Sutskever et al., 2014). As a result, Transformer (Vaswani et al., 2017) introduced positional embedding to indicate the positional relation for the inputs. Formally, a sequence x = [x 1 , x 2 , ⋯, x T ] defines each element as x i = (f i , t i ) with f i ∈ F being the nontemporal feature at time i and t i ∈ T as an temporal feature (or we called it positional embedding). Note that f i can be the word representation (in neural machine translation (Vaswani et al., 2017)), a pixel in a frame (in video activity recognition ), or a music unit (in music generation (Huang et al., 2018b)). t i can be a mixture of sine and cosine functions (Vaswani et al., 2017) or parameters that can be learned during back-propagation (Dai et al., 2019;Ott et al., 2019). The feature vector are defined over a joint space X ∶= (F × T ). The resulting permutationinvariant set is: Followed the definition by Vaswani et al. (2017), we use queries(q)/keys(k)/values(v) to represent the inputs for the attention. To be more precise, x {q k v} is used for denoting a query/key/value data in the query/key/value sequence x {q k v} (x {q k v} ∈ S x { q k v} ) with S x { q k v} being its set representation. We note that the input sequences are the same (x q = x k ) for self-attention and are different (x q from decoder and x k from encoder) for encoder-decoder attention.
Given the introduced notation, the attention mechanism in original Transformer (Vaswani et al., 2017) can be presented as: with x q = f q + t q , x k = f k + t k , W q k v being the weight, and d k being the feature dimension of x k W k . Decoder self-attention further introduces a mask to block the visibility of elements in S x k to x q . Particularly, decoder self-attention considers the decoded sequence as inputs (x k = x q ), where the decoded token at time t is not allowed to access the future decoded tokens (i.e., tokens decoded at time greater than t). On the contrary, encoder selfattention and decoder-encoder attention consider no additional mask to Eq. (1). Recent work Dai et al., 2019;Huang et al., 2018b;Child et al., 2019;Parmar et al., 2018;Tsai et al., 2019a) proposed modifications to the Transformer for the purpose of better modeling inputs positional relation Huang et al., 2018b;Dai et al., 2019), appending additional keys in S x k (Dai et al., 2019), modifying the mask applied to Eq. (1) (Child et al., 2019), or applying to distinct feature types Parmar et al., 2018;Tsai et al., 2019a). These works adopt different designs of attention as comparing to the original form (Eq. (1)). In our paper, we aim at providing an unified view via the lens of kernel.

Reformulation via the Lens of Kernel
We now provide the intuition to reformulate Eq.
(1) via the lens of kernel. First, the softmax function can be realized as a probability function for x q observing the keys {x k }s in S x k (S x k is the set representation of sequence x k ). The probability is determined by the dot product between x q and x k with additional mappings W q W k and scaling by d k , which we note the dot-product operation is an instance of kernel function. We also introduce a set filtering function M (x q , S x k ) ∶ X × S → S which returns a set with its elements that operate with (or are connected/visible to) x q . The filtering function M (⋅, ⋅) plays as the role of the mask in decoder self-attention (Vaswani et al., 2017). Putting these altogether, we re-represent Eq. (1) into the following definition.
Definition 1. Given a non-negative kernel function k(⋅, ⋅) ∶ X × X → R + , a set filtering function M (⋅, ⋅) ∶ X × S → S, and a value function v(⋅) ∶ X → Y, the Attention function taking the input of a query feature x q ∈ X is defined as The Definition 1 is a class of linear smoothers (Wasserman, 2006) with kernel smoothing: where v(x k ) outputs the "values" and p( is a probability function depends on k and N when k(⋅, ⋅) is always positive. In the prior work (Vaswani et al., 2017), The new formulation defines a larger space for composing attention by manipulating its individual components, and at the same time it is able to categorize different variants of attention in prior work Huang et al., 2018b;Dai et al., 2019;Child et al., 2019;Tsai et al., 2019a). In the following, we study these components by dissecting Eq. (2) into: 1) kernel feature space X , 2) kernel construction k(⋅, ⋅), 3) value function v(⋅), and 4) set filtering function M (⋅, ⋅).

Kernel Feature Space X
In Eq. (2), to construct a kernel on X , the first thing is to identify the kernel feature space X . In addition to modeling sequences like word sentences (Vaswani et al., 2017) or music signals (Huang et al., 2018b), the Transformer can also be applied to images (Parmar et al., 2018), sets , and multimodal sequences (Tsai et al., 2019a). Due to distinct data types, these applications admit various kernel feature space: (i) Sequence Transformer (Vaswani et al., 2017;Dai et al., 2019): with F being non-positional feature space and T being the positional embedding space of the position in the sequence.
(ii) Image Transformer (Parmar et al., 2018): with F being non-positional feature space, H being the positional space of the height in an image, and W being the positional space of the width in an image. (iii) Set Transformer  and Non-Local Neural Networks : with no any positional information present. (iv) Multimodal Transformer (Tsai et al., 2019a): with F representing the language feature space, F v representing the vision feature space, F a representing the audio feature space, and T representing the temporal indicator space.
For the rest of the paper, we will focus on the setting for sequence Transformer X = (F × T ) and discuss the kernel construction on it.

Kernel Construction and the Role of
Positional Embedding k(⋅, ⋅) The kernel construction on X = (F × T ) has distinct design in variants of Transformers (Vaswani et al., 2017;Dai et al., 2019;Huang et al., 2018b;Child et al., 2019). Since now the kernel feature space considers a joint space, we will first discuss the kernel construction on F (the non-positional feature space) and then discuss how different variants integrate the positional embedding (with the positional feature space T ) into the kernel.
Kernel construction on F. All the work considered the scaled asymmetric exponential kernel with the mapping W q and W k (Wilson et al., 2016;Li et al., 2017) for non-positional features f q and f k : Note that the usage of asymmetric kernel is also commonly used in various machine learning tasks (Yilmaz, 2007;Tsuda, 1999;Kulis et al., 2011), where they observed the kernel form can be flexible and even non-valid (i.e., a kernel that is not symmetric and positive semi-definite). In Section 3, we show that symmetric design of the kernel has similar performance for various sequence learning tasks, and we also examine different kernel choices (i.e., linear, polynomial, and rbf kernel).
Kernel construction on X = (F × T ). The designs for integrating the positional embedding t q and t k are listed in the following.
(i) Absolute Positional Embedding (Vaswani et al., 2017;Dai et al., 2019;Ott et al., 2019): For the original Transformer (Vaswani et al., 2017), each t i is represented by a vector with each dimension being sine or cosine functions. For learned positional embedding (Dai et al., 2019;Ott et al., 2019), each t i is a learned parameter and is fixed for the same position for different sequences. These works defines the feature space as the direct sum of its temporal and non-temporal space: X = F ⊕ T . Via the lens of kernel, the kernel similarity is defined as (ii) Relative Positional Embedding in Transformer-XL (Dai et al., 2019): t represents the indicator of the position in the sequence, and the kernel is chosen to be asymmetric of mixing sine and cosine functions: with k fq t q , t k being an asymmetric kernel with We refer readers to Dai et al. (2019) for more details.
(iii) Relative Positional Embedding of  and Music Transformer (Huang et al., 2018b): t ⋅ represents the indicator of the position in the sequence, and the kernel is modified to be indexed by a look-up table: where L tq−t k ,fq = exp(f q W q a tq−t k ) with a ⋅ being a learnable matrix having matrix width to be the length of the sequence. We refer readers to  for more details. Dai et al. (2019) showed that the way to integrate positional embedding is better through Eq. (5) than through Eq. (6) and is better through Eq. (6) than through Eq. (4). We argue the reason is that if viewing f i and t i as two distinct spaces X ∶= (F × T ) , the direct sum x i = f i + t i may not be optimal when considering the kernel score between x q and x k . In contrast, Eq. (5) represents the kernel as a product of two kernels (one for f i and another for t i ), which is able to capture the similarities for both temporal and non-temporal components.

Value Function v(⋅)
The current Transformers consider two different value function construction: (i) Original Transformer (Vaswani et al., 2017) and Sparse Transformer (Child et al., 2019): (ii) Transformer-XL (Dai et al., 2019), Music Transformer (Huang et al., 2018b), Self-Attention with Relative Positional Embedding : Compared Eq. (7) to Eq. (8), Eq. (7) takes the positional embedding into account for constructing the value function. In Section 3, we empirically observe that constructing value function with Eq. (8) constantly outperforms the construction with Eq. (7), which suggests that we do not need positional embedding for value function.

Set Filtering Function M (⋅, ⋅)
In Eq. (2), the returned set by the set filtering function M (x q , S x k ) defines how many keys and which keys are operating with x q . In the following, we itemize the corresponding designs for the variants in Transformers: (i) Encoder Self-Attention in original Transformer (Vaswani et al., 2017): For each query x q in the encoded sequence, M (x q , S x k ) = S x k contains the keys being all the tokens in the encoded sequence. Note that encoder self-attention considers x q = x k with x q being the encoded sequence.
(ii) Encoder-Decoder Attention in original Transformer (Vaswani et al., 2017): For each query x q in decoded sequence, M (x q , S x k ) = S x k contains the keys being all the tokens in the encoded sequence. Note that encode-decoder attention considers x q ≠ x k with x q being the decoded sequence and x k being the encoded sequence.
(iii) Decoder Self-Attention in original Transformer (Vaswani et al., 2017): For each query x q in the decoded sequence, M (x q , S x k ) returns a subset of S x k (M (x q , S x k ) ⊂ S x k ). Note that decoder self-attention considers x q = x k with x q being the decoded sequence. Since the decoded sequence is the output for previous timestep, the query at position i can only observe the keys being the tokens that are decoded with position < i. For convenience, let us define S 1 as the set returned by original Transformer (Vaswani et al., 2017) from M (x q , S x k ), which we will use it later.
(iv) Decoder Self-Attention in Transformer-XL (Dai et al., 2019): For each query x q in the decoded sequence, M (x q , S x k ) returns a set containing S 1 and additional memories (M (x q , S x k ) = S 1 + S mem , M (x q , S x k ) ⊃ S 1 ). S mem refers to additional memories.
(v) Decoder Self-Attention in Sparse Transformer (Child et al., 2019): For each query x q in the decoded sentence, M (x q , S x k ) returns a subset of S 1 (M (x q , S x k ) ⊂ S 1 ).
To compare the differences for various designs, we see the computation time is inversely proportional to the number of elements in M (x q , S x k ). For performance-wise comparisons, Transformer-XL (Dai et al., 2019) showed that, the additional memories in M (x q , S x k ) are able to capture longer-term dependency than the original Transformer (Vaswani et al., 2017) and hence results in better performance. Sparse Transformer (Child et al., 2019) showed that although having much fewer elements in M (x q , S x k ), if the elements are carefully chosen, the attention can still reach the same performance as Transformer-XL (Dai et al., 2019).

Exploring the Design of Attention
So far, we see how Eq. (2) connects to the variants of Transformers. By changing the kernel construction in Section 2.2.2, we can define a larger space for composing attention. In this paper, we present a new form of attention with a kernel that is 1) valid (i.e., a kernel that is symmetric and positive semi-definite) and 2) delicate in the sense of constructing a kernel on a joint space (i.e., X = (F × T )): where W F and W T are weight matrices. The new form considers product of kernels with the first kernel measuring similarity between non-temporal features and the second kernel measuring similarity between temporal features. Both kernels are symmetric exponential kernel. Note that t i here is chosen as the mixture of sine and cosine functions as in the prior work (Vaswani et al., 2017;Ott et al., 2019). In our experiment, we find it reaching competitive performance as comparing to the current state-of-the-art designs (Eq. (5) by Dai et al. (2019)). We fix the size of the weight matrices W ⋅ in Eq. (9) and Eq. (5) which means we save 33% of the parameters in attention from Eq. (9) to Eq. (5) (Eq. (5) has weights W Q W K W R and Eq. (9) has weights W F W T ).

Experiments
By viewing the attention mechanism with Eq.
(2), we aims at answering the following questions regarding the Transformer's designs: Q1. What is the suggested way for incorporating positional embedding in the kernel function?
Q2. What forms of kernel are recommended to choose in the attention mechanism? Can we replace the asymmetric kernel with the symmetric version?
Q3. Is there any exception that the attention mechanism is not order-agnostic with respect to inputs? If so, can we downplay the role of positional embedding?
Q4. Is positional embedding required in value function?
We conduct experiments on neural machine translation (NMT) and sequence prediction (SP) tasks since these two tasks are commonly chosen for studying Transformers (Vaswani et al., 2017;Dai et al., 2019). Note that NMT has three different types of attentions (e.g., encoder self-attention, decoder-encoder attention, decoder self-attention) and SP has only one type of attention (e.g., decoder self-attention). For the choice of datasets, we pick IWSLT'14 German-English (De-En) dataset (Edunov et al., 2017) for NMT and WikiText-103 dataset (Merity et al., 2016) for SP as suggested by Edunov et al. (Edunov et al., 2017) and Dai et al. (Dai et al., 2019). For fairness of comparisons, we train five random initializations and report test accuracy with the highest validation score. We fix the position-wise operations in Transformer 3 and only change the attention mechanism. Similar to prior work (Vaswani et al., 2017;Dai et al., 2019), we report BLEU score for NMT and perplexity for SP.

Incorporating Positional Embedding
In order to find the best way to integrate positional embedding (PE), we study different PE incorporation in the kernel function k(⋅, ⋅) in Eq. (2). Referring to Sections 2.2.2 and 2.3, we consider four cases: 1) PE as direct sum in the feature space (see Eq. (4)), 2) PE as a look-up table (see Eq. (6)), 3) PE in product kernel with asymmetric kernel (see Eq. (5)), and 4) PE in product kernel with symmetric kernel (see Eq. (9)). We present the results in Table 1.
First, we see that by having PE as a look-up table, it outperforms the case with having PE as direct-sum in feature space, especially for SP task. Note that the look-up table is indexed by the relative position (i.e., t q − t k ) instead of absolute position. Second, we see that PE in the product kernel proposed by Dai et al. (Dai et al., 2019) may not  (Edunov et al., 2017) and SP stands for sequence prediction on WikiText-103 dataset (Merity et al., 2016). ↑ means the upper the better and ↓ means the lower the better.  (5)) Product Kernel k exp f q , f k ⋅ k fq t q , t k 33.62 24.10 Ours (Eq. (9)) Product Kernel k F f q , f k ⋅ k T t q , t k 34.71 24.28  constantly outperform the other integration types (it has lower BLEU score for NMT). Our proposed product kernel reaches the best result in NMT and is competitive to the best result in SP.

Kernel Types
To find the best kernel form in the attention mechanism, in addition to the exponential kernel (see Eq. (3)), we compare different kernel forms (i.e., linear, polynomial, and rbf kernel) for the nonpositional features. We also provide the results for changing asymmetric to the symmetric kernel, when forcing W q = W k , so that the resulting kernel is a valid kernel (Scholkopf and Smola, 2001). The numbers are shown in Table 2. Note that, for fairness, other than manipulating the kernel choice of the non-positional features, we fix the configuration by Vaswani et al. (Vaswani et al., 2017) for NMT and the configuration by Dai et al. (Dai et al., 2019) for SP. We first observe that the linear kernel does not converge for both NMT and SP. We argue the reason is that the linear kernel may have negative value and thus it violates the assumption in kernel smoother that the kernel score must be positive (Wasserman, 2006). Next, we observe the kernel with infinite feature space (i.e., exponential and rbf kernel) outperforms the kernel with fi-nite feature space (i.e., polynomial kernel). And we see rbf kernel performs the best for NMT and exponential kernel performs the best for SP. We conclude that the choice of kernel matters for the design of attention in Transformer. Also, we see no much performance difference when comparing asymmetric to symmetric kernel. In the experiment, we fix the size of W ⋅ in the kernel, and thus adopting the symmetric kernel benefits us from saving parameters.

Order-Invariance in Attention
The need of the positional embedding (PE) in the attention mechanism is based on the argument that the attention mechanism is an order-agnostic (or, permutation equivariant) operation (Vaswani et al., 2017;Huang et al., 2018b;Dai et al., 2019;Child et al., 2019). However, we show that, for decoder self-attention, the operation is not order-agnostic. For clarification, we are not attacking the claim made by the prior work (Vaswani et al., 2017;Huang et al., 2018b;Dai et al., 2019;Child et al., 2019), but we aim at providing a new look at the order-invariance problem when considering the attention mechanism with masks (masks refer to the set filtering function in our kernel formulation). In other words, previous work did not consider the Table 3: Order-Invariance in Attention. To save the space, we denote Encoder Self-Attention / Encoder-Decoder Attention / Decoder Self-Attention as A/B/C. Note that SP only has decoder self-attention.   mask between queries and keys when discussing the order-invariance problem (Pérez et al., 2019).
To put it formally, we first present the definition by  for a permutation equivariance function: Definition 2. Denote Π as the set of all permutations over [n] = {1, ⋯, n}. A function f unc ∶ X n → Y n is permutation equivariant iff for any permutation π ∈ Π, f unc(πx) = πf unc(x).  showed that the standard attention (encoder self-attention (Vaswani et al., 2017;Dai et al., 2019) ) is permutation equivariant. Here, we present the non-permutation-equivariant problem on the decoder self-attention: Proposition 1. Decoder self-attention (Vaswani et al., 2017;Dai et al., 2019) is not permutation equivariant.
To proceed the proof, we need the following definition and propositions. Definition 3. Denote Π as the set of all permutations over [n] = {1, ⋯, n} and S π x k as performing permutation π over S x k . Attention(x q ; S x k ) is said to be permutation equivariant w.r.t. S x k if and only if for any π ∈ Π, Attention(x q ; S π x k ) = Attention(x q ; S x k ).
Proposition 2. Attention with the set filtering function M (x q , S x k ) = S x k is permutation equivariant w.r.t. S x k .
Proof. It is easy to show that if M (x q , S x k ) = S x k , Eq. (2) remains unchanged for any permutation π performed on S x k . ∎ Proposition 3. Attention with the set difference Then, we construct a permutation π such that x ∈ M (x q , S π x k ). It is obvious that Eq. (2) changes after this permutation and thus Attention x q ; M (x q , S x k ) is not permutation equivariant w.r.t. S x k . ∎

Proof.
[Proof for Proposition 1] First, we have x q ∼ S x k . Hence, showing Attention(x q ; S x k ) not permutation equivariant w.r.t. S x k equals to showing Attention not permutation equivariant. Then, since the decoder self-attention considers masking (i.e., M (x q , S x k ) returns a subset of S x k ), by Proposition 3, the decoder self-attention is not permutation equivariant. ∎ In fact, not only being a permutation inequivariant process, the decoding process in the decoder self-attention already implies the order information from the data. To show this, take the decoded sequence y = [init, y 1 , y 2 , y 3 , y 4 ] as an example. init stands for the initial token. When determining the output y 1 from init, the set filtering function is M (init, S y ) = {init}. Similarly, we will have M (y 1 , S y ), M (y 2 , S y ), M (y 3 , S y ) to be {init, y 1 }, {init, y 1 , y 2 }, {init, y 1 , y 2 , y 3 }. Then, it raises a concern: do we require PE in decoder self-attention? By removing PE in decoder selfattention, we present the results in Table 3. From the table, we can see that, for NMT, removing PE only in decoder self-attention results in slight performance drop (from 34.71 to 34.49). However, removing PE in the entire model greatly degrades the performance (from 34.71 to 14.47). On the other hand, for SP, removing PE from our proposed attention variant dramatically degrades the performance (from 24.28 to 30.92). Nonetheless, the performance is slightly better than considering PE from the original Transformer (Vaswani et al., 2017).

Positional Embedding in Value Function
To determine the need of positional embedding (PE) in value function, we conduct the experiments by adopting Eq. (7) or Eq. (8) in the attention mechanism. The results are presented in Table 4. From the table, we find that considering PE in value function (Eq. (7)) does not gain performance as compared to not considering PE in value function (Eq. (8)).

Take-Home Messages
Based on the results and discussions, we can now answer the questions given at the beginning of this section. The answers are summarized into the take-home messages in the following.
A1. We show that integrating the positional embedding in the form of product kernel (Eq. (5) or Eq. (9)) gives us best performance.
A2. The kernel form does matter. Adopting kernel form with infinite feature dimension (i.e., exponential kernel or rbf kernel) gives us best results. The symmetric design of the kernel may benefit us from saving parameters and barely sacrifice the performance as compared to the non-symmetric one.
A3. The decoder self-attention is not an orderagnostic operation with respect to the order of inputs. However, incorporating positional embedding into the attention mechanism may still improve performance.
A4. We find that there is no much performance difference by considering or not considering the positional embedding in value function.