Recurrent Positional Embedding for Neural Machine Translation

In the Transformer network architecture, positional embeddings are used to encode order dependencies into the input representation. However, this input representation only involves static order dependencies based on discrete numerical information, that is, are independent of word content. To address this issue, this work proposes a recurrent positional embedding approach based on word vector. In this approach, these recurrent positional embeddings are learned by a recurrent neural network, encoding word content-based order dependencies into the input representation. They are then integrated into the existing multi-head self-attention model as independent heads or part of each head. The experimental results revealed that the proposed approach improved translation performance over that of the state-of-the-art Transformer baseline in WMT’14 English-to-German and NIST Chinese-to-English translation tasks.


Introduction
Transformer translation systems (Vaswani et al., 2017), without recurrent and convolutional neural networks, rely on a positional embedding (PE) approach to encode order information into the input representation. PE is typically learned based on the position index of each word and is added to corresponding word embedding. This allows the Transformer to encode order dependencies between words in addition to the words themselves. Finally, the Transformer uses these combined vectors as the input to selfattention networks (SANs), achieving state-of-theart translation performance with several language pairs (Vaswani et al., 2017;Dou et al., 2018;Marie et al., 2018Marie et al., , 2019. * Corresponding author In spite of their success, the input representation only involves static order dependencies based on discrete numerical information. That is, any word in the entire vocabulary has the same PE on the same position index. As a result, the dependencies encoded by the original PEs are independent of word content, which may further hinder the improvement of translation capacity. Recently, Chen et al. (2018) and Hao et al. (2019) introduced the additional source representation learned by an RNN-based encoder into Transformer to alleviate this issue, and reported improvements on the WMT'14 English-to-German translation task.
Inspired by their works (Chen et al., 2018;Hao et al., 2019), we propose an simple and efficient recurrent positional embedding approach to capture order dependencies based on word content in a sentence, thus learning a more effective sentence representation for the Transformer. In addition, we designed two simple multi-head self-attentions to introduce these learned RPEs and original input representation into the existing Transformer model for enhancing the sentence representation of Transformer.
Experimental results on WMT'14 English-to-German and NIST Chinese-to-English translation tasks show that our models significantly improved translation performance over a strong Transformer baseline.

Input Representation
In the Transformer network architecture (Vaswani et al., 2017), given a sentence of length J, the positional embedding of each word is firstly computed based on its position: where j is the word's numerical position index in the sentence and i is the dimension of position index. The word embedding x j is then added with pe j to get a combined embedding v j : As a result, there is a sequence of vectors {v 1 , · · · , v J } as the input to the encoder or decoder of the Transformer to learn the source or target sentence representations.

Self-Attention Mechanism
Following the input layer, the self-attention layer is used to learn the sentence representation for the Transformer (Vaswani et al., 2017). These combined vectors {v 1 , · · · , v J } are then packed into a query matrix Q, their keys and values matrices K and V. The output of the self-attention layer can be computed by Eq. (3): Moreover, self-attention can be further refined as multi-head self-attention to jointly attend to the information from different representation subspaces at different positions. Specifically, Q, K, and V are linearly projected H times with different, learned linear projections to d k and d v dimensions, respectively. On each of these projected versions of queries, keys, and values, the attention function is performed in parallel, yielding d v -dimensional output values. Take a single head as an example, the output of the h-th head O h is computed by Eq.(4): where the parameter matrices are For example, if there are H=8 heads and d model is 512, d k =d v =512/8=64. Finally, the outputs of H heads are concatenated to serve as the sentence representation S:

Recurrent Positional Embedding
We propose a recurrent positional embedding approach based on part of word embedding instead of numerical indices of words to capture order dependencies between words in a sentence. Specifically, the embedding of each word x j is divided into two parts x p j and x r j , and their dimensions are d p and d r (d model =d p +d r ), respectively. As a result, the sequence of word vectors {x 1 , · · · , x J } are spited into {x p 1 , · · · , x p J } and {x r 1 , · · · , x r J }. An RNN with an nonlinear projection layer is then designed to learn its recurrent state r j for each word over {x r 1 , · · · , x r J }: where W r ∈ R dr×dr is a parameter matrix and b r ∈ R dr is a bias item. 1 Note that the x r j is derived from part of the word embedding x j . Finally, there is a sequence R ={r 1 , · · · , r J }, called as recurrent positional embeddings (RPEs). In this work, a bidirectional RNN and a forward RNN (Bahdanau et al., 2015) are used to learn source RPEs and target RPEs, respectively. Noth that the RNN is also replaced by other neural networks for learning order dependency information, such as GRU (Cho et al., 2014), and SRU (Li et al., 2018a).
In addition, other sub-sequence {x p 1 , · · · , x p J } is used to gain the reduced dimension input representation P={p 1 , · · · , p J } according to the Section 2.1. Both of R and P will be together as the input to the encoder (or decoder) to learn a more effective source (or target) representation for the Transformer.

Neural Machine Translation with RPE
To make use of these learned RPEs, we propose two simple methods: RPE head (RPEHead) selfattention and mixed positional representation head (MPRHead) self-attention. Both of RPEHead and MPRHead can utilize RPEs to learn sentence representation for the Transformer.

RPEHead Self-Attention
For the RPEHead self-Attention, these learned RPEs are integrated into multi-head self-attention (Fig 1(a)) as several independent heads to learn the sentence representation, as shown in Fig 1(b).
To perform the attention function in Eq. (3)  and d model -d r in the process of learning RPEs, respectively. This guarantees that there are two types of heads: one type of head only contains RPE and other type of head only contains the original reduced dimension input representation. Second, the T is mapped to a new query matrix Q T , and their keys and values matrices K T and V T . According to Eq.(4), the output of each head is computed by Eq. (7): Therefore, the final sentence representation S T is formally represented as:

MPRHead Self-Attention
Compared with the RPEHead model, the MPRHead model applies the multi-head mechanism to the RPEs. In other words, RPEs are encoded into the sentence representation from different vector sub-spaces, as shown in Fig 1(c).
To this end, each vector of P is divided into the H heads {p 1 j , · · · , p H j }. Similarly, each vector of R is divided into the H heads {r 1 j , · · · , r H j }. The corresponding heads for p j and r j are then concatenated as a combined sequence pr j , in turn: where ":" is the concatenation operation. All heads in pr j are further concatenated as a mixed embedding m j ∈ R d model in turn. As a result, there is a new sequence of mixed embeddings: The M is mapped to a query matrix Q M , and their keys and values matrices K M and V M .
According to Eq.(4) and Eq. (5), the final sentence representation S M is represented as: Note that for both RPEHead and MPRHead models, the RPEs will be jointly learned with the existing Transformer architecture.

Experimental Setup
The proposed methods were evaluated on the WMT'14 English to German (EN-DE) and NIST Chinese-to-English (ZH-EN) translation tasks.
The ZH-EN training set includes 1.28 million bilingual sentence pairs from the LDC corpora, where the NIST06 and the NIST02/NIST03/NIST04 data sets were used as the development and test sets, respectively. The EN-DE training set includes 4.43 million bilingual sentence pairs of the WMT'14 corpora, where the newstest2013 and newstest2014 data sets were used as the development and test sets, respectively.
The BPE (Sennrich et al., 2016) was adopted and the vocabulary size was set as 32K. The dimension of all input and output layers was set to 512, and that of the inner feedforward neural network layer was set to 2048. The total heads of all multi-head modules were set to 8 in both encoder and decoder layers. In each training batch, there was a set of sentence pairs containing approximately 4096*4 source tokens and 4096*4 target tokens. For the other setting not mentioned, we followed the setting in Vaswani et al. (2017).
Baseline systems included a vanilla Transformer (Vaswani et al., 2017), Relative PEs (Shaw et al., 2018), and directional SAN (DiSAN) (Shen System Architecture newstest2014 #Param Existing NMT systems Vaswani et al. (2017) Transformer ( Table 1: Results for EN-DE translation task. The mark "*" after scores indicates that the model was significantly better than the baseline Transformer (base or big) at the significance level p <0.01 (Collins et al., 2005). . We reimplemented the baseline Transformer, Relative PEs, and DiSAN models on the OpenNMT toolkit (Klein et al., 2017). All the models were trained for 200k batches and evaluated on a single V100 GPU. The multi-bleu.perl was used as the evaluation metric to obtain the case-sensitive 4gram BLEU score of EN-DE and ZH-EN tasks.

Effect of RPEs
In this work, we extracted d r dimensions of each word vector to learn recurrent embeddings. To explore the relation between d r and translation performance, Figure 2 shows the translation performance on the different d r . For +RPEHead (or +MPRHead), with the increasing in the dimension of d r , BLEU scores are gradually increasing, but BLEU scores begin to decrease when d r is more than 320 (or 256). In particular, +RPEHead and +MPRHead achieve the highest BLEU scores at d r =320 and d r =256, respectively. This means that the original partial input representation and our RPE can complement each other to improve translation performance.

Main Results
According to the results of Fig 2, d r of RPEHead is set to 320 and d r of MPRHead is set to 256. The main translation results are shown in Table 1. 1) For the proposed methods, both RPEHead (base) and MPRHead (base) outperformed Transformer (base), especially are better than +RPE and +DiSAN. This indicates that the learned RPEs are beneficial for the Transformer system.
2) Moreover, +MPRHead (base) performed better than RPEHead/RPE (base). The reason may be that adding RPEs into variant heads can encode order dependencies based word vectors from variant vector sub-spaces, which is one of the advantages of the multi-head mechanism. In particular, +MPRHead (base/big) is slightly better than Transformer (base/big)+BiARN. This denotes that the RPEs can more effective improve the performance of Transformer.
3) MPRHead (big) was superior to Transformer (big) significantly. MPRHead (base) achieved a comparable performance compared to Transformer (big) which contains approximately three times parameters, indicating that the proposed RPE is efficient.
In addition, Table 2 shows that the proposed models also gave similar improvements over the baseline system and the compared methods on the NIST ZH-EN task. These results indicate that our  Table 2: Results for ZH-EN translation Task. The mark "*" after scores indicates that the model was significantly better than the baseline Transformer (base or big) at the significance level p <0.01 (Collins et al., 2005).
approach is a universal method for improving the translation of other language pairs.

Ablation Experiments
To further explore the effect of position information, PEs and RPEs were applied on the encoder (Enc) and decoder (Dec) side, respectively.  Table 3: Ablation experiments of position information. "#Speed" denotes the training speed (tokens/second).
1) For Enc/Dec/Enc&Dec, the proposed RPE-Head/MPRHead outperformed the Transformer in corresponding ablation settings, indicating that our methods worked well on both the encoder and decoder.
2) For PE/RPEs, performance became slightly lower when PE was removed from the decoder. However, when PE was removed from the encoder, the translation performance drastically decreased. We think that the source sentence representation is more sensitive to position information than the target sentence representation. It is possible that each hidden state in the decoder takes the previous hidden state into consideration, which would be somewhat similar to the proposed RPEs. In contrast, the encoder would not be expected to contain this mechanism. Therefore, the position information would be more important in the encoder than the decoder.
3) The training speeds of the proposed RPEHead (base) and MPRHead (base) were just slightly slower than those of the vanilla Transformer (base), because we nearly did not introduce additional (less than 2%) parameters.

Conclusion and Future Work
In this paper, we presented a recurrent embedding to capture the order dependencies in a sentence. Empirical results show that this method can improve the performance of NMT efficiently. In future work, we plan to extend this method to unsupervised NMT  and other natural language processing tasks, such as dependency parsing (Li et al., 2018b) and reading comprehension (Zhang et al., 2018b).