Self-Attentive Residual Decoder for Neural Machine Translation

Neural sequence-to-sequence networks with attention have achieved remarkable performance for machine translation. One of the reasons for their effectiveness is their ability to capture relevant source-side contextual information at each time-step prediction through an attention mechanism. However, the target-side context is solely based on the sequence model which, in practice, is prone to a recency bias and lacks the ability to capture effectively non-sequential dependencies among words. To address this limitation, we propose a target-side-attentive residual recurrent network for decoding, where attention over previous words contributes directly to the prediction of the next word. The residual learning facilitates the flow of information from the distant past and is able to emphasize any of the previously translated words, hence it gains access to a wider context. The proposed model outperforms a neural MT baseline as well as a memory and self-attention network on three language pairs. The analysis of the attention learned by the decoder confirms that it emphasizes a wider context, and that it captures syntactic-like structures.


Introduction
Neural machine translation (NMT) has recently become the state-of-the-art approach to machine translation (Bojar et al., 2016). Several architectures have been proposed for this task (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Cho et al., 2014;Gehring et al., 2017;Vaswani et al., 2017), but the attention-based NMT model designed by Bahdanau et al. (2015) is still considered the de-facto baseline. This architecture is composed of two recurrent neural networks (RNNs), an encoder and a decoder, and an attention mechanism between them for modeling a (a) Baseline NMT decoder (b) Self-attentive residual dec. soft word-alignment. First, the model encodes the complete source sentence, and then decodes one word at a time. The decoder has access to all the context on the source side through the attention mechanism. However, on the target side, the contextual information is represented only through a fixed-length vector, namely the hidden state of the decoder. As observed by Bahdanau et al. (2015), this creates a bottleneck which hinders the ability of the sequential model to learn longer-term information effectively.
As pointed out by Cheng et al. (2016), sequential models present two main problems for natural language processing. First, the memory of the encoder is shared across multiple words and is prone to bias towards the recent past. Second, such models do not fully capture the structural composition of language. To address these limitations, several recent models have been proposed, namely memory networks (Cheng et al., 2016;Tran et al., 2016; and self-attention networks (Daniluk et al., 2016;Liu and Lapata, 2018). We experimented with these methods, applying them to NMT: memory RNN (Cheng et al., 2016) and self-attentive RNN (Daniluk et al., 2016). How-ever, we observed no significant gains in performance over the baseline architecture.
In this paper, we propose a self-attentive residual recurrent decoder, presented in Figure 1b, which, if unfolded over time, represents a denselyconnected residual network. The self-attentive residual connections focus selectively on previously translated words and propagate useful information to the output of the decoder, within an attention-based NMT architecture. The attention paid to the previously predicted words is analogous to a read-only memory operation, and enables the learning of syntactic-like structures which are useful for the translation task.
Our evaluation on three language pairs shows that the proposed model improves over several baselines, with only a small increase in computational overhead. In contrast, other similar approaches have lower scores but a higher computational overhead. The contributions of this paper can be summarized as follows: • We propose and compare several options for using self-attentive residual learning within a standard decoder, which facilitates the flow of contextual information on the target side. • We demonstrate consistent improvements over a standard baseline, and two advanced variants, which make use of memory and self-attention on three language pairs (English-to-Chinese, Spanish-to-English, and English-to-German). • We perform an ablation study and analyze the learned attention function, providing additional insights on its actual contributions.

Related Work
Several studies have been proposed to enhance sequential models by capturing longer contexts. Long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) is the most commonly used recurrent neural network (RNN), because its internal memory allows to retain information from a more distant past than a vanilla RNN. Several studies attempt to increase the memory capacity of LSTMs by using memory networks Sukhbaatar et al., 2015). For instance, Cheng et al. (2016) incorporate different memory cells for each previous output representation, which are later accessed by an attention mechanism. Tran et al. (2016) include a memory block to access recent input words in a selective manner. Both methods show improvements on language modeling. For NMT,  presented a decoder enhanced with an external shared memory. Memory networks extend the capacity of the network and have the potential to read, write, and forget information. Our method, which attends over previously predicted words, can be seen as a read-only memory, which is simpler but computationally more efficient because it does not require additional memory space.
Other studies aim to improve the modeling of source-side contextual information, for example through a context-aware encoder using selfattention (Zhang et al., 2017), or a recurrent attention NMT (Yang et al., 2017) that is aware of previously attended words on the source-side in order to better predict which words will be attended in future. Additionally, variational NMT (Zhang et al., 2016a) introduces a latent variable to model the underlying semantics of source sentences. In contrast to these studies, we focus instead on the contextual information on the target side.
The application of self-attention mechanisms to RNNs have been previously studied, and in general, they seem to capture syntactic dependencies among distant words (Liu and Lapata, 2018;Soltani and Jiang, 2016;Lin et al., 2017). Daniluk et al. (2016) explore different approaches to self-attention for language modeling, leading to improvements over a baseline LSTM and over memory-augmented methods. However, the methods do not fully utilize a longer context. The main difference with our approach is that we apply attention on the output embeddings rather than the hidden states. Thus, the connections are independent of the recurrent layer representations, which is beneficial to NMT, as we show below.
Our model relies on residual connections, which have been shown to improve the learning process of deep neural networks by addressing the vanishing gradient problem (He et al., 2016). These connections create a direct path from previous layers, helping the transmission of information. Recently, several architectures using residual connections with LSTMs have been proposed for sequence prediction (Zhang et al., 2016b;Kim et al., 2017;Zilly et al., 2017;Wang and Tian, 2016). To our knowledge, our study is the first one to use self-attentive residual connections within residual RNNs for NMT. In parallel to our study, a similar method was recently proposed for sentiment analysis (Wang, 2017).

Background: Neural Machine Translation
Neural machine translation aims to compute the conditional distribution of emitting a sentence in a target language given a sentence in a source language, denoted by p Θ (y|x), where Θ is the set of parameters of the neural model, and y = {y 1 , ..., y n } and x = {x 1 , ..., x m } are respectively the representations of source and target sentences as sequences of words. The parameters Θ are learned by training a sequence-to-sequence neural model on a corpus of parallel sentences. In particular, the learning objective is to maximize the following conditional log-likelihood: The models typically use gated recurrent units (GRUs) (Cho et al., 2014) or LSTMs (Hochreiter and Schmidhuber, 1997). Their architecture has three main components: an encoder, a decoder, and an attention mechanism. The goal of the encoder is to build meaningful representations of the source sentences. It consists of a bidirectional RNN which includes contextual information from past and future words into the vector representation h i of a particular word vector x i , formally defined as follows: Here, ) are the hidden states of the forward and backward passes of the bidirectional RNN respectively, and f is a non-linear function.
The decoder (see Figure 1a) is in essence a recurrent language model. At each time step, it predicts a target word y t conditioned over the previous words and the information from the encoder using the following posterior probability: where g is a non-linear multilayer function. The hidden state of the decoder s t is defined as: and depends on a context vector c t that is computed by the attention mechanism. The attention mechanism allows the decoder to select which parts of the source sentence are more useful to predict the next output word. This goal is achieved by considering a weighted sum over all hidden states of the encoder as follows: where α t i is a weight calculated using a normalized exponential function a, also known as alignment function, which computes how good is the match between the input at position i ∈ {1, ..., n} and the output at position t: Different types of alignment functions have been used for NMT, as investigated by Luong et al. (2015). Here, we use the one originally defined by Bahdanau et al. (2015).

Self-Attentive Residual Decoder
The decoder of the attention-based NMT model uses a skip connection from the previously predicted word to the output classifier in order to enhance the performance of translation. As we can see in Eq.
(3), the probability of a particular word is calculated by a function g which takes as input the hidden state of the recurrent layer s t , the representation of the previously predicted word y t−1 , and the context vector c t . Within g, these quantities are typically summed up after going through simple linear transformations, hence the addition of y t−1 is indeed a skip connection as in residual networks (He et al., 2016). In theory, s t should be sufficient for predicting the next word given that it is dependent on the other two local-context components according to Eq. (4). However, the y t−1 quantity makes the model emphasize the last predicted word for generating the next word. How can we make the model consider a broader context?
To answer this question, we propose to include into the decoder's formula skip connections not only from the previous time step y t−1 , but from all previous time steps from y 0 to y t−1 . This defines a residual recurrent network which, unfolded over time, can be seen as a densely connected residual network. These connections are applied to all previously predicted words, and reinforce the memory of the recurrent layer towards what has been translated so far. At each time step, the model decides which of the previously predicted words should be emphasized to predict the next one. In order to deal with the dynamic length of this new input, we use a target-side summary vector d t that can be interpreted as the representation of the decoded sentence until the time t in the word embedding space. We therefore modify Eq. (3) replacing y t−1 with d t : The replacement of y t−1 with d t means that the number of parameters added to the model is dependent only on the calculation of d t . Figure 1b illustrates the change made to the decoder. We define two methods for summarizing the context into d t , which are described in the following sections.

Mean Residual Connections
One simple way to aggregate information from multiple word embeddings is by averaging them. This average can be seen as the sentence representation until time t. We hypothesize that this representation is more informative than using only the embedding of the previous word. Formally:

Self-Attentive Residual Connections
Averaging is a simple and cheap way to aggregate information from multiple words, but may not be sufficient for all kinds of dependencies. Instead, we propose a dynamic way to aggregate information in each sentence, such that different words have different importance according to their relation with the prediction of the next word. We propose to use a shared self-attention mechanism to obtain a summary representation of the translation, i.e. a weighted average representation of the words translated from y 0 to y t−1 . This mechanism aims to model, in part, important non-sequential dependencies among words, and serves as a complementary memory to the recurrent layer.
The weights of the attention model are computed by a scoring function e t i that predicts how important each previous word (y 0 , ..., or y t−1 ) is for the current prediction y t .
We experiment with two different scoring functions, as follows: where v ∈ R e , W y ∈ R e×e , and W s ∈ R e×d are weight matrices, e and d are the dimensions of the embeddings and hidden states respectively. Firstly, we study the scoring function noted con-tent+scope, as proposed by Bahdanau et al. (2015) for NMT. Secondly, we explore a scoring function noted as content, which is calculated based only on the previous hidden states of the decoder, as proposed by Pappas and Popescu-Belis (2017).
In contrast to the first attention function, which makes use of the hidden vector s t , the second one is based only on the previous word representations, therefore, it is independent of the current prediction representation. However, the normalization of this function still depends on t.

Other Self-Attentive Networks
To compare our approach with similar studies, we adapted two representative self-attentive networks for application to NMT.

Memory RNN
The Memory RNN decoder is based on the proposal by Cheng et al. (2016) to modify an LSTM layer to include a memory with different cells for each previous output representation. Thus at each time step, the hidden layer can select past information dynamically from the memory. To adapt it to our framework, we modify Eq. (4) as: wheres

Self-Attentive RNN
The Self-Attentive RNN is the simplest one proposed by Daniluk et al. (2016), and incorporates a summary vector from past predictions calculated with an attention mechanism. Here, the attention is applied over previous hidden states. This decoder is formulated as follows: Additional details of the formulations in Sections 3, 4, and 5 are described in the Appendix A.
6 Experimental Settings

Datasets
To evaluate the proposed MT models in different conditions, we select three language pairs with increasing amounts of training data: English-Chinese (0.5M sentence pairs), Spanish-English (2.1M), and English-German (4.5M).
For English-to-Chinese, we use a subset of the UN parallel corpus (Rafalovitch and Dale, 2009) 1 , with 0.5M sentence pairs for training, 2K for development, and 2K for testing. For training Spanish-to-English MT, we use a subset of WMT 2013 (Bojar et al., 2013), corresponding to Europarl v7 and News Commentary v11 with ca. 2.1M sentence pairs. Newstest2012 and New-stest2013 were used for development and testing respectively. Finally, we use the complete English-to-German set from WMT 2016 (Bojar et al., 2016) with a total of ca. 4.5M sentence pairs. The development set is Newstest2013, and the testing set is Newstest2014. Additionally, we include as testing sets Newstest2015 and New-stest2016, for comparison with the state of the art. We report translation quality using (a) BLEU over tokenized and truecased texts, and (b) NIST BLEU over detokenized and detruecased texts 2 .

Model Configuration
We use the implementation of the attention-based NMT baseline provided in dl4mt-tutorial 3 developed in Python using Theano (Theano Development Team, 2016). The system implements an attention-based NMT model, described above, using one layer of GRUs (Cho et al., 2014). The vocabulary size is 25K for Englishto-Chinese NMT, and 50K for Spanish-to-English and English-German. We use the byte pair encoding (BPE) strategy for out-of-vocabulary words  Table 1: BLEU score (multi-bleu) on tokenized text. The highest score per dataset is marked in bold. The self-attentive residual connections make use of the content attention function. |Θ| indicates the number of parameters per model. (Sennrich et al., 2016b). For all cases, the maximum sentence length of the training samples is 50, the dimension of the word embeddings is 500, and the dimension of the hidden layers is 1,024. We use dropout with a probability of 0.5 after each layer. The parameters of the models are initialized randomly from a standard normal distribution scaled to a factor of 0.01. The loss function is optimized using Adadelta (Zeiler, 2012) with = 10 −6 and ρ = 0.95 as in the original paper. The systems were trained in 7-12 days for each model on a Tesla K40 GPU at the speed of about 1,000 words/sec. Table 1 shows the BLEU scores and the number of parameters used by the different NMT models. Along with the NMT baseline, we included a statistical machine translation (SMT) model based on Moses (Koehn et al., 2007) with the same training/tuning/test data as the NMT. The performance of memory RNN is similar to the baseline and, as confirmed later, its focus of attention is mainly on the prediction at t − 1. The self-attentive RNN method is inferior to the baseline, which can be attributed to the overhead on the hidden vectors that have to learn the recurrent representations and the attention simultaneously. The proposed models outperform the baseline, and the best scores are obtained by the NMT model with self-attentive residual connections. Despite their simplicity, the mean residual connections already improve the translation, without increasing the number of parameters. Tables 2 and 3 show further experiments with the proposed methods on various English-German test sets, compared to several previous systems.

Models NT14 NT15
NMT (unk. word repl.) (Luong et al., 2015) 20.9 -Context-aware NMT (Zhang et al., 2017) 22.57 -Recurrent attention NMT (Yang et al., 2017) Table 2 shows BLEU values calculated by multibleu, and includes the NMT system proposed by Luong et al. (2015) which replaces unknown predicted words with the most strongly aligned word on the source sentence. Also, the table includes other systems described in Section 2. Additionally, Table 3 shows values calculated by the NIST BLEU scorer, as well as results reported by the "Winning WMT" systems for each test set respectively: UEDIN-SYNTAX (Williams et al., 2014), UEDIN-SYNTAX (Williams et al., 2015), and UEDIN-NMT (Sennrich et al., 2016a). Also, we include the results reported by Sennrich et al. (2016b) for a baseline encoder-decoder NMT with BPE for unknown words similar to our configuration, and finally the system proposed by Nadejde et al. (2017), an explicit syntax-aware NMT that introduces combinatory categorial grammar (CCG) supertags on the target side by predicting words and tags alternately. The comparison with this work is relevant for the analysis described  later in Section 8.2. The results confirm that the self-attentive residual connections improve significantly the translations. To evaluate the significance of the improvements against the NMT baseline, we performed a one-tailed paired t-test.

Impact of the Attention Function
We now examine the two scoring functions that can be used for the self-attentive residual connections model presented in Eq. (12), considering English-to-Chinese and Spanish-to-English. The BLEU scores are presented in Table 4: the best option is the content matching function, which depends only on the word embeddings. The con-tent+scope function, which depends additionally on the hidden representation of the current prediction is better than the baseline but scores lower than content. The idea that the importance of the context depends on the current prediction is appealing, because it can be interpreted as learning internal dependencies among words. However, the experimental results show that it does not necessarily lead to the best translation. On the contrary, the content attention function may be extracting representations of the whole sentence which are easier to learn and generalize.

Performance According to Human Evaluation
Manual evaluation on samples of 50 sentences for each language pair helped to corroborate the conclusions obtained from the BLEU scores, and to provide a qualitative understanding of the improvements brought by our model. For each language, we employed one evaluator who was a native speaker of the target language and had good knowledge of the source language. The evaluators ranked three translations of the same source sentence -one from each of our models: baseline, mean residual connections, and self-attentive residual connections -according to their translation quality. The three translations were presented in a random order, so that the system that had generated them could not be identified. To integrate
Systems d Perplexity LSTM (Daniluk et al., 2016) 300 85.2 LSTM + Attention (Daniluk et al., 2016) 296 82.0 LSTM + 4-gram (Daniluk et al., 2016) 968 75.9 LSTM + Mean residual connections 296 80.2 LSTM + Self-attentive residual connections 296 80.4 the judgments, we proceed in pairs, and count the number of times each system was ranked higher, equal to, or lower than another competing system. The results shown in Table 5 indicate that the self-attentive residual connections model outperforms the one with mean residual connections, and both outperform the baseline, for all three language pairs. The rankings are thus identical to those obtained using BLEU in Tables 1 and 3.

Performance on Language Modeling
To examine whether language modeling (LM) can benefit from the proposed method, we incorporate the residual connections into a neural LM. We use the same setting as Daniluk et al. (2016) for a corpus of Wikipedia articles (22.5M words), and we compare with two methods proposed in the same paper, namely attention LSTM and 4-gram LSTM. As shown in Table 6, the proposed models outperform the LSTM baseline as well as the selfattention model, but not the 4-gram LSTM. Experiments using 4-gram LSTM for NMT showed poor performance (13.9 BLEU points for English-Chinese) which can be attributed to the difference between the LM and NMT tasks. Both tasks predict one word at a time conditioned over previous words, however, in NMT the previous targetword-inputs are not given, they have to be generated by the decoder. Thus, the output could be conditioned over previous erroneous predictions Figure 2: Percentage of words that received maximum attention at a given relative position, ranging from −1 to −50 (maximum length).
affecting in higher proportion the 4-gram LSTM model. This shows that even if a model improves language modeling, it does not necessarily improve machine translation.
8 Qualitative Analysis 8.1 Distribution of Attention Figure 2 shows a comparison of the distribution of attention of the different self-attentive models described in this paper, on Spanish-to-English NMT (the other two language pairs exhibit similar distributions). The values correspond to the number of words which received maximal attention for each relative position (x-axis). We selected, at each prediction, the preceding word with maximal weight, and counted its relative position. We normalized the count by the number of previous words at the time of each prediction. We observe that the memory RNN almost always selects the immediately previous word (t−1) and ignores the rest of the context. On the contrary, the other two models distribute attention more evenly among all previous words. In particular, the self-attentive RNN uses a longer context than the self-attentive residual connections but, as the performance on BLEU score shows, this fact does not necessarily mean better translation. Figure 3 shows the attention to previous words generated by each model for one sentence translated from Spanish to English. The matrices present the target-side attention weights, with the vertical axis indicating the previous words, and the color shades at each position (cell) representing the attention weights. The weights of the memory RNN are concentrated on the diagonal, indicating that the attention is generally located on tree.addChild(subtree) 12: end if 13: end function 14: tree ← newT ree(); SPLIT(tree, A, s) the previous word, which makes the model almost equivalent to the baseline. The weights of the self-attentive RNN show that attention is more distributed towards the distant past, and they vary for each word because the attention function depends on the current prediction. This model tries to find dependencies among words, although complex relations seem difficult to learn. On the contrary, the proposed self-attentive residual connections model strongly focuses on particular words, and we present a wider analysis of it in the following section.

Structures Learned by the Model
When visualizing the matrix of attention weights generated by our model (Figure 3c), we observed the formation of sub-phrases which are grouped depending on their attention to previous words. To build the sub-phrases in a deterministic fashion, we implemented Algorithm 1, which iteratively splits the sentence into two sub-phrases every time the focus of attention changes to a new word, from left-to-right. The results are binary tree structures containing the sub-phrases, exem- plified in Figure 4.
We formally evaluate the syntactic properties of the binary tree structures by comparing them with the results of an automatic constituent parser (Manning et al., 2014), using the ParsEval approach (Black et al., 1991), i.e. by counting the precision and recall of constituents, excluding single words. Our models reaches a precision of 0.56, which is better than the precision of 0.45 obtained by a trivial right-branched tree model 4 . Note that these structures were neither optimized for parsing nor learned using part-of-speech tagging as most parsers do. Our interpretation of the results is that they are "syntactic-like" structures. However, given the simplicity of the model, they could  also be viewed as more limited structures, similar to sentence chunks. Table 7 shows examples of translations produced with the baseline and the self-attentive residual connections model. The first part shows examples for which the proposed model reached a higher BLEU score than the baseline. Here, the structure of the sentences, or at least the word order, are improved. The second part contains examples where the baseline achieved better BLEU score than our model. In the first example, the structure of the sentence is different but the content and quality are similar, while in the second one lexical choices differ from the reference.

Conclusion
We presented a novel decoder which uses selfattentive residual connections to previously translated words in order to enrich the target-side contextual information in NMT. To cope with the variable lengths of previous predictions, we proposed two methods for context summarization: mean residual connections and self-attentive residual connections. Additionally, we showed how similar previous proposals, designed for language modeling, can be adapted to NMT. We evaluated the methods over three language pairs: Chineseto-English, Spanish-to-English, and English-to-German. In each case, we improved the BLEU score compared to the NMT baseline and two variants with memory-augmented decoders. A manual evaluation over a small set of sentences for each language pair confirmed the improvement. Finally, a qualitative analysis showed that the proposed model distributes weights throughout an entire sentence, and learns structures resembling syntactic ones.
As future work, we plan to enrich the present attention mechanism with the key-value-prediction technique (Daniluk et al., 2016;Miller et al., 2016) which was shown to be useful for language modeling. Moreover, we will incorporate relative positional information to the attention function. To encourage further research in self-attentive residual connections for NMT an other similar tasks, our code is made publicly available 5 . where and W e ∈ R d×2d are weight matrices.

A.3 Decoder
The input of the decoder are the previous word y t−1 and the context vector c t , the objective is to predict y t . The hidden states of the decoder s = (s 1 , ..., s n ) are initialized with the mean of the context vectors: where W init ∈ R d×2d is a weight matrix, m is the size of the source sentence. The following hidden states are calculated with a GRU conditioned over the context vector at tine t as follows: s t = z t s t + (1 − z t ) s t where s t = tanh(Ey t−1 + U [r t s t−1 ] + Cc t ) z i = σ(W z Ey t−1 + U z s t−1 + C z c t ) r i = σ(W r Ey t−1 + U r s t−1 + C r c t ) Here, E ∈ R e×V is the embedding matrix for the target language. W, W z , W r ∈ R d×e , U, U z , U r ∈ R d×d , and C, C z , C r ∈ R d×2d are weight matrices. The intermediate vector s t is calculated from a simple GRU: s t = GRU (y t−1 , s t−1 ) In the attention-based NMT model, the probability of a target word y t is given by: p(y t |s t , y t−1 , c t ) = sof tmax(W o tanh( W st s t + W yt y t−1 + W ct c t )) Here, W o ∈ R V ×e , W st ∈ R e×d , W yt ∈ R e×e , W ct ∈ R e×2d are weight matrices.

A.3.1 Self-Attentive Residual Connections
In our model, the probability of a target word y t is given by: p(y t |s t , d t , c t ) = sof tmax(W o tanh( W st s t + W dt d t + W ct c t )) Here, W o ∈ R V ×e , W st ∈ R e×d , W dt , W yt ∈ R e×e , W ct ∈ R e×2d are weight matrices. The summary vector d t can be calculated in different manners based on previous words y 1 to y t−1 . First, a simple average: The second, by using an attention mechanism: where v ∈ R e , W y ∈ R e×e are weight matrices.

A.3.2 Memory RNN
This model modifies the recurrent layer of the decoder as follows: s t = z t s t + (1 − z t ) s t where s t = tanh(Ey t−1 + U [r t s t ] + Cc t ) z i = σ(W z Ey t−1 + U zst + C z c t ) r i = σ(W r Ey t−1 + U rst + C r c t ) Here, E ∈ R e×V is the embedding matrix for the target language. W, W z , W r ∈ R d×e , U, U z , U r ∈ R d×d , and C, C z , C r ∈ R d×2d are weight matrices. The intermediate vector s t is calculated from a simple GRU: The recurrent vectors t is calculated as following: where v ∈ R d , W m ∈ R d×d , and W s ∈ R d×d are weight matrices.

A.3.3 Self-Attentive RNN
The formulation of this decoder is as following: p(y t |y 1 , ..., y t−1 , c t ) ≈ sof tmax(W o tanh( W st s t + W yt y t−1 + W ct c t + W mtst )) Here, W o ∈ R V ×e , W st ∈ R e×d , W yt ∈ R e×e , W ct ∈ R e×2d , and W mt ∈ R e×d are weight matrices.s where v ∈ R d , W m ∈ R d×d , and W s ∈ R d×d are weight matrices.