DA-Transformer: Distance-aware Transformer

Transformer has achieved great success in the NLP field by composing various advanced models like BERT and GPT. However, Transformer and its existing variants may not be optimal in capturing token distances because the position or distance embeddings used by these methods usually cannot keep the precise information of real distances, which may not be beneficial for modeling the orders and relations of contexts. In this paper, we propose DA-Transformer, which is a distance-aware Transformer that can exploit the real distance. We propose to incorporate the real distances between tokens to re-scale the raw self-attention weights, which are computed by the relevance between attention query and key. Concretely, in different self-attention heads the relative distance between each pair of tokens is weighted by different learnable parameters, which control the different preferences on long- or short-term information of these heads. Since the raw weighted real distances may not be optimal for adjusting self-attention weights, we propose a learnable sigmoid function to map them into re-scaled coefficients that have proper ranges. We first clip the raw self-attention weights via the ReLU function to keep non-negativity and introduce sparsity, and then multiply them with the re-scaled coefficients to encode real distance information into self-attention. Extensive experiments on five benchmark datasets show that DA-Transformer can effectively improve the performance of many tasks and outperform the vanilla Transformer and its several variants.


Introduction
Transformer (Vaswani et al., 2017) has achieved huge success in the NLP field in recent years (Kobayashi et al., 2020). It serves as the basic architecture of various state-of-the-art models like BERT (Devlin et al., 2019) and GPT (Radford et al., 2019), and boosts the performance of many tasks like text generation (Koncel-Kedziorski et al., 2019), machine translation (Vaswani et al., 2017), and reading comprehension (Xu et al., 2019). Thus, the improvement on the Transformer architecture would be beneficial for many NLP-related fields (Wu et al., 2020a).
A core component of Transformer is multi-head self-attention, which is responsible for modeling the relations between contexts Guo et al., 2019). However, self-attention is position-agnostic since it does not distinguish the orders of inputs. Thus, in the vanilla Transformer, position encoding is applied to the input to help Transformer capture position information. However, in contrast to recurrent and convolutional neural networks, it is difficult for vanilla Transformers to be aware of the token distances (Shaw et al., 2018), which are usually important cues for context modeling. Thus, several works explored to incorporate token distance information into Transformer. For example, Shaw et al. (2018) proposed to combine the embeddings of relative positions with attention key and value in the self-attention network. They restricted the maximum relative distance to only keep the precise relative position information within a certain distance. Yan et al. (2019) proposed a variant of self-attention network for named entity recognition, which incorporates sinusoidal embeddings of relative position to compute attention weights in a direction-and distance-aware way. However, the distance or relative position embeddings used by these methods usually cannot keep the precise information of the real distance, which may not be beneficial for the Transformer to capture word orders and the context relations.
In this paper, we propose a distance-aware Transformer (DA-Transformer), which can explicitly exploit real token distance information to enhance context modeling by leveraging the relative distances between different tokens to re-scale the raw attention weights before softmax normalization.
More specifically, since global and local context modeling usually have different distance preferences, we propose to learn a different parameter in different attention heads to weight the token distances, which control the preferences of attention heads on long or short distances. In addition, since the weighted distances may not have been restricted to a proper range, we propose a learnable sigmoid function to map the weighted distances into rescaled coefficients. They are further multiplied with the raw attention weights that are clipped by the ReLU function for keeping the non-negativity and introducing sparsity. We conduct extensive experiments on five benchmark datasets for different tasks, and the results demonstrate that our approach can effectively enhance the performance of Transformer and outperform its several variants with distance modeling.
The main contributions of this paper include: • We propose a distance-aware Transformer that uses the real token distances to keep precise distance information in adjusting attention weights for accurate context modeling.
• We propose to use different parameters to weight real distances in different attention heads to control their diverse preferences on short-term or long-term information.
• We propose a learnable sigmoid function to map the weighted distances into re-scaled coefficients with proper ranges for better adjusting the attention weights.
• We conduct extensive experiments on five benchmark datasets and the results validate the effectiveness of our proposed method.
2 Related Work

Transformer
To make this paper self-contained, we first briefly introduce the architecture of Transformer, which was initially introduced to the machine translation task (Vaswani et al., 2017). It has become an important basic neural architecture of various state-ofthe-art NLP models like BERT (Devlin et al., 2019) and GPT (Radford et al., 2019 V to respectively transform the input matrix H into the input query Q (i) , key K (i) and value V (i) , which is formulated as follows: Then, it uses a scaled dot-product attention head to process its query, key and value, which is formulated as follows: where d is the dimension of the vectors in the query and key. The outputs of the h attention heads are concatenated together and the final output is a linear projection of the concatenated representations, which is formulated as follows: where W O is an output projection matrix. In the standard Transformer, a position-wise feed-forward neural network is further applied to the output of multi-head self-attention network. Its function is formulated as follows: where W 1 , W 2 , b 1 , b 2 are kernel and bias parameters. Transformer also employs layer normalization (Ba et al., 2016) and residual connection  techniques after the multi-head selfattention and feed-forward neural networks, which are also kept in our method. Since self-attention network does not distinguish the order and position of input tokens, Transformer adds the sinusoidal embeddings of positions to the input embeddings to capture position information. However, position embeddings may not be optimal for distance modeling in Transformer because distances cannot be precisely recovered from the dot-product between two position embeddings.

Distance-aware Transformer
Instead of directly using the sinusoidal position embedding (Vaswani et al., 2017) or the absolute position embedding (Devlin et al., 2019), several variants of the Transformer explore to use the relative positions to better model the distance between contexts (Shaw et al., 2018;Dai et al., 2019;Yan et al., 2019). For example, Shaw et al. (2018) proposed to add the embeddings of relative positions to the attention key and value to capture the relative distance between two tokens. They only kept the precise distance within a certain range by using a threshold to clip the maximum distance to help generalize to long sequences. Dai et al. (2019) proposed Transformer-XL, which uses another form of relative positional encodings that integrate content-dependent positional scores and a global positional score into the attention weights. Yan et al. (2019) proposed direction-aware sinusoidal relative position embeddings and used them in a similar way with Transformer-XL. In addition, they proposed to use the un-scaled attention to better fit the NER task. However, relative position embeddings may not be optimal for modeling distance information because they usually cannot keep the precise information of real token distances. Different from these methods, we propose to directly re-scale the attention weights based on the mapped relative distances instead of using sinusoidal position embeddings, which can explicitly encode real distance information to achieve more accurate distance modeling.

DA-Transformer
In this section, we introduce our proposed distanceaware Transformer (DA-Transformer) approach, which can effectively exploit real token distance information to enhance context modeling. It uses a learnable parameter to weight the real distances between tokens in each attention head, and uses a learnable sigmoid function to map the weighted distances into re-scaled coefficients with proper ranges, which are further used to adjust the raw attention weights before softmax normalization. The details of DA-Transformer are introduced in the following sections.

Head-wise Distance Weighting
Similar with the standard Transformer, the input of our model is also a matrix that contains the representation of each token, which is denoted as where N is the length of the sequence. We denote the real relative distance between the i-th and j-th positions as R i,j , which is computed by R i,j = |i − j|. We can then obtain the relative distance matrix R ∈ R N ×N that describes the relative distance between each pair of positions. In each attention head, we use a learnable parameter w i to weight the relative distance by R (i) = w i R, which will be further used to adjust the self-attention weights. In our method, we stipulate that a more positive R (i) will amplify the attention weights more strongly while a more negative R (i) will diminish them more intensively. Thus, a positive w i means that this attention head prefers to capture long-distance information, while a negative w i means that it focuses more on local contexts. By learning different values of w i , different attention heads may have different preferences on capturing either short-term or long-term contextual information with different intensity.

Weighted Distance Mapping
Since the raw weighted distances may not be in the proper range for adjusting the attention weights, we need to map them into the re-scaled coefficients via a functionR (i) = f (R (i) ) that is suitable for adjusting the self-attention weights. However, it is not a trivial task to design the function f (·) because it needs to satisfy the following requirements: (1) f (0) = 1. We stipulate that zero distances do not influence the self-attention weights. (2) The value of f (R (i) ) should be zero when R (i) → −∞. This requirement is to guarantee that if an attention head prefers to capture local information (w i < 0), the long-distance information should be surpassed. 1 (3) The value of f (R (i) ) should be limited when R (i) → +∞. This requirement is to ensure that the model is able to process long sequences without over-emphasize distant contexts. (4) The scale of f (·) needs to be tunable. This aims to help the model better adjust the intensity of distance information. (5) The function f (·) needs to be mono-tone. To satisfy the five requirements above, we propose a learnable sigmoid function to map the weighted relative distances R (i) , which is formulated as follows: where v i is a learnable parameter in this head that controls the upperbound and ascending steepness of this function. The curves of our learnable sigmoid function under several different values of v i are plotted in Fig. 1. We can see that the proposed function satisfies all the requirements above. In addition, from this figure we find that if v i is larger, the upperbound of the curve is higher, which means that distance information is more intensive. When v i = 0, it is in fact identical to the standard sigmoid function except for the scaling factor of 2. By mapping the weighted distances R (i) via the function f (·), we can obtain the final re-scaled coefficientŝ R (i) in a learnable way. Several illustrative examples of the re-scaled coefficients under w i = ±1 and v i = ±1 are respectively shown in Figs. 2(a)-2(d). We can see that if w i is positive, long-distance contexts are preferred while short-term contexts are surpassed. The situation is reversed if w i turns to negative. In addition, the coefficients in Fig. 2(c) have larger dynamic ranges than the coefficients in Fig. 2(a), indicating that long-distance information is more dominant in Fig. 2(c). Moreover, the coefficients in Fig. 2(d) are "sharper" than those in Fig. 2(b), which indicates that the model tends to capture shorter distances.

Attention Adjustment
Then, we use the re-scaled coefficients to adjust the raw attention weights that are computed by the dotproduct between the query and key, i.e., . Different from existing methods that add the querykey dot-product with position or distance representations, in our approach we propose to multiply the re-scaled coefficients with the query-key dot-product. This is because for the tokens whose relations are very weak, if their re-scaled coefficients are large, their final attention weights will be over-amplified if we simply add the re-scaled coefficients to their raw attention weights. This is not optimal for modeling contextual information because the attention weights of irrelevant contexts cannot be fully surpassed. However, there are also some problems if we directly multiply the  re-scaled coefficientsR (i) and the raw attention . This is because the sign of atten- is indefinite and the multiplied results cannot accurately reflect the influence of distance information. Thus, we propose to add a ReLU (Glorot et al., 2011) activation function to the raw attention weights to keep non-negativity. In this way, the final output O (i) of an attention head can be formulated as follows: where * represents element-wise product. The ReLU function can also introduce sparsity to the self-attention because only the positive attention weights can be amplified by the re-scaled coefficients, which makes the attention weights in our method sharper. We concatenate the output from the h independent attention heads, and project it into a unified output. In addition, we keep the same layer normalization and residual connection strategy as the standard Transformer.

Computational Complexity Analysis
Compared with the standard Transformer, the major additional time cost is brought by computing the re-scaled coefficientsR (i) and using them to adjust the attention weights. The theoretical time complexity of the two operations in each head is O(N 2 ), which is much smaller than the time complexity of computing the attention weights, i.e., O(N 2 × d). In addition, both Eq. (5) and Eq. (6) in our approach can be computed in a vectorized manner. Thus, the additional time consumption of our method is very light. Besides, the increase of parameters is also minimal because we only introduce 2h additional parameters, which are usually ignorable compared with the projection matrices like W (i) Q . Thus, our approach inherits the efficiency of the Transformer architecture.

Datasets and Experimental Settings
Our experiments are conducted on five benchmark datasets for different tasks. Four of them are benchmark NLP datasets. The first one is AG's News 2 (denoted as AG), which is a news topic classification dataset. The second one is Amazon Electronics (He and McAuley, 2016) (denoted as Amazon), which is a dataset for review rating prediction. The third one is Stanford Sentiment Treebank (Socher et al., 2013) (denoted as SST). We use the binary classification version of this dataset. The fourth one is Stanford Natural Language Inference (Bowman et al., 2015) (SNLI) dataset, which is a widely used natural language inference dataset. The detailed statistics of these datasets are summarized in Table 1. In addition, we also conduct experiments on a benchmark news recommendation dataset named MIND (Wu et al., 2020c), aiming to validate the effectiveness of our approach in both text and user modeling. It contains the news impression logs of 1 million users from Microsoft News 3 from October 12 to November 22, 2019. The training set contains the logs in the first five weeks except those on the last day which are used for validation. The rest logs are used for test. The key statistics of this dataset are summarized in Table 2   In our experiments, we use the 300-dimensional Glove (Pennington et al., 2014) embeddings for word embedding initialization. 4 The number of attention head is 16, and the output dimension of each attention is 16. We use one Transformer layer in all experiments. On the AG, SST and SNLI datasets, we directly apply Transformer-based methods to the sentences. On the Amazon dataset, since reviews are usually long documents, we use Transformers in a hierarchical way by learning sentence representations from words via a word-level Transformer first and then learning document representations from sentences via a sentence-level Transformer. On the MIND dataset, following (Wu et al., , 2020b we also use a hierarchical model architecture that first learns representations of historical clicked news and candidate news from their titles with a word-level Transformer, then learns user representations from the representations of clicked news with a news-level Transformer, and final matches user and candidate news representations to compute click scores. 5 We use the same model training strategy with negative sampling techniques as NRMS . On all datasets we use Adam (Kingma and Ba, 2015) as the optimization algorithm and the learning rate is 1e-3. On the AG, Amazon, SST and SNLI datasets, accuracy and macro-Fscore are used as the performance metric. On the MIND dataset, following  we use the average AUC, MRR, nDCG@5 and nDCG@10 scores of all sessions as the metrics. Each experiment is repeated 5 times independently and the average results with standard deviations are reported.

Performance Evaluation
We compare our proposed DA-Transformer method with several baseline methods, including: (1) Transformer (Vaswani et al., 2017)     of Transformer that uses direction-and distanceaware position encoding. The results of our approach and these methods on the five datasets are respectively shown in Tables 4 and 5. From the results, we have several observations.
First, compared with the vanilla Transformer, the compared methods that consider distance information consistently achieve better performance. It shows that distance information is very important in context modeling. Second, among the methods with distance information, the performance of Transformer-RPR is lower than the others. This may be because Transformer-RPR does not keep the precise long-distance information. Third, by comparing Transformer-XL and Adapted Transformer, we find that the performance of Adapted Transformer is better on the SST dataset, while Transformer-XL is better on other datasets. This is probably because Adapted Transformer is more suitable for modeling local contexts and the sentences in the SST dataset are usually short, while Transformer-XL may be more appropriate for modeling long sequences. Fourth, our method consistently achieves better performance on the five datasets, and its improvement over the second best method is statistically significant (t-test p<0.05). This is because our method can explicitly encode real distance information rather than using positional encoding, making the modeling of distance more accurate.
We further compare the performance of different methods in a rating regression task on the Amazon dataset. The results are shown in Fig. 3. From Fig. 3 we observe similar patterns with the results in classification tasks, which validate the generality of our DA-Transformer in different genres of tasks.

Influence of Different Mapping Functions
Next, we study the influence of using different mapping functions f (·) for computing the re-scaled coefficients. We compare the performance of our method w.r.t. several different f (·), including: (1) f (x) = min(x, T ) (clip), using a threshold T to clip the weighted distance; (2) f (x) = k i x+b i (linear), using a linear transformation to the weighted distance; (3) f (x) = exp(x) (exponent), using an exponent function to map the weighted distance; (4) f (x) = 1 1+exp(−x) (sigmoid), using the sigmoid function to activate the weighted distance; and (5) , our learnable sigmoid function. Due to space limitation, we only present the results on the AG, Amazon and MIND datasets in Fig. 4. From these results, we find that clip is not optimal for mapping the weighted distance. This is because it cannot keep the precise distance information beyond a certain range. In addition, simply using the linear transformation is also insufficient. This may be because our attention adjustment method requires f (·) to be positive, but linear transformation cannot guarantee. Besides, we find that the sigmoid function and our proposed function are better than the exponential function. This may be because long sequences will lead to the problem of exponent explosion, which is harmful to context modeling. Moreover, our proposed learnable sigmoid function is better than the standard sigmoid function. It shows that adjusting the activation function in a learnable way can better map the raw distances into re-scaled coefficients.

Influence of Different Attention Adjusting Methods
Then, we explore the influence of different methods for adjusting the raw attention weights. We consider four different kinds of methods, including: (1) adding the re-scaled coefficients to the attention weights normalized by softmax (late add); (2) multiplying the re-scaled coefficients with the attention weights normalized by softmax (late multiply); (3) adding the re-scaled coefficients to the raw attention weights before normalization (early add), which is widely used in existing methods like Transformer-XL; (4) multiplying the re-scaled coefficients with the raw attention weights activated by ReLU, which is the method used in our approach (early multiply). The results on the AG, Amazon and MIND datasets are shown in Fig. 5. According to these results, we find that early adjustment is better than late adjustment. This may be because the late adjustment methods will change the total amount of attention, which may not be optimal.
In addition, we find that multiplying is better than adding for both early and late adjustment. This may be because adding large re-scaled coefficients may over-amplify some attention weights. For example, if a raw attention weight is relatively small, it is not suitable to add large re-scaled coefficients to it because the corresponding contexts may not have close relations. In contrast, multiplying the re-scaled coefficients will not over-amplify the low attention weights. Moreover, in our early multiply method we further propose to use the ReLU function to introduce sparsity to make the Transformer more "focused". Thus, our method is better than the existing early add method in adjusting the attention weights.

Model Interpretation
Finally, we interpret our proposed method by visualizing its key parameters and the attention weights. we first visualize the parameters w i and v i in our method, which control the preferences of attention heads on long-term or short-term information and the shape of the learnable sigmoid function, respectively. The visualization results on the AG and MIND datasets are respectively shown in Figs. 6 and 7. 6 From Fig. 6, we find it is very interesting that half of the parameters w i are positive and the rest of them are negative. It indicates that half of the attention heads mainly aim to capture local contexts, while the rest ones are responsible for modeling long-distance contexts. It may be because both short-term and long-term contexts are useful for understanding news topics. In addition, we find that most attention heads have negative v i while the rest are positive. It shows that on the AG dataset the intensity of attention adjustment is mild in most attention heads. From Fig. 7(a), we find long-term information is somewhat more important than local information in modeling news texts for     news recommendation. However, from Fig. 7(b) we find an interesting phenomenon that only one head has a strong negative w i while the values of w i in all the rest heads are positive. It means that only one attention head tends to capture short-term user interests while all the other heads prefer to capture long-term user interests. This is intuitive because users usually tend not to intensively click very similar news and their long-term interests may have more decisive influence on their news clicks. In addition, we find it is interesting that on MIND all values of v i are positive. It may indicate that distance information has a strong impact on the attention weights. These visualization results show that DA-Transformer can flexibly adjust its preference on short-term or long-term information and the intensity of attention adjustment by learning different values of w i and v i according to the task characteristics. 7 We then visualize the attention weights produced by the vanilla Transformer and the distance-aware attention weights in our DA-Transformer method. The attention weights of a sentence in the AG dataset computed by four different attention heads are respectively shown in Figs. 8(a) and 8(b). From Fig. 8(a), we find it is difficult to interpret the selfattention weights because they are too "soft". In addition, it is difficult for us to understand the dif-(a) Vanilla Transformer.
(b) DA-Transformer. The first two heatmaps are produced by heads with wi < 0 and others are produced by heads with wi > 0. ferences between the information captured by different attention heads. Different from the vanilla Transformer, from Fig. 8(b) we find that the attention weights obtained by our method are more sparse, indicating that the attention mechanism in our method is more focused. In addition, it is easier for us to interpret the results by observing the attention heatmap. For example, the first two heatmaps in Fig. 8(b) are produced by the two attention heads with preferences on short-term contexts. We can see that they mainly capture the relations among local contexts, such as the relations between "biotech" and "sector". Differently, in the latter two heatmaps obtained by the two attention heads that prefer long-term contexts, we can observe that the model tends to capture the relations between a word (e.g., "biotech") with the global contexts. These results show that different attention heads in our method are responsible for capturing different kinds of information, and their differences can be directly observed from the self-attention weights. Thus, our method can be better interpreted than vanilla Transformers.

Conclusion
In this paper, we propose a distance-aware Transformer, which can leverage the real distance between contexts to adjust the self-attention weights for better context modeling. We propose to first use different learnable parameters in different attention heads to weight the real relative distance between tokens. Then, we propose a learnable sigmoid function to map the weighted distances into re-scaled coefficients with proper ranges. They are further multiplied with the raw attention weights that are activated by the ReLU function to keep non-negativity and produce sharper attention. Extensive experiments on five benchmark datasets show that our approach can effectively improve the performance of Transformer by introducing real distance information to facilitate context modeling.