Neural Machine Translation with Reordering Embeddings

The reordering model plays an important role in phrase-based statistical machine translation. However, there are few works that exploit the reordering information in neural machine translation. In this paper, we propose a reordering mechanism to learn the reordering embedding of a word based on its contextual information. These learned reordering embeddings are stacked together with self-attention networks to learn sentence representation for machine translation. The reordering mechanism can be easily integrated into both the encoder and the decoder in the Transformer translation system. Experimental results on WMT’14 English-to-German, NIST Chinese-to-English, and WAT Japanese-to-English translation tasks demonstrate that the proposed methods can significantly improve the performance of the Transformer.


Introduction
The reordering model plays an important role in phrase-based statistical machine translation (PB-SMT), especially for translation between distant language pairs with large differences in word order, such as Chinese-to-English and Japaneseto-English translations (Galley and Manning, 2008;Goto et al., 2013). Typically, the traditional PBSMT learns large-scale reordering rules from parallel bilingual sentence pairs in advance to form a reordering model. This reordering model is then integrated into the translation decoding process to ensure a reasonable order of translations of the source words (Chiang, 2005;Xiong et al., 2006;Galley and Manning, 2008). In contrast to the explicit reordering model for PBSMT, the RNN-based NMT (Sutskever et al., 2014;Bahdanau et al., 2015) depends on neural networks to implicitly encode order dependencies * Corresponding author between words in a sentence to generate a fluent translation. Inspired by a distortion method originating in SMT (Brown et al., 1993;Koehn et al., 2003;Al-Onaizan and Papineni, 2006), there is a quite recent preliminary exploration work for NMT . They distorted the existing content-based attention by an additional position-based attention inside the fixed-size window, and reported a considerable improvement on the classical RNN-based NMT. This means that the word reordering information is also beneficial to the NMT.
The Transformer (Vaswani et al., 2017) translation system relies on self-attention networks (SANs), and has attracted growing interesting in the machine translation community.
The Transformer generates an ordered sequence of positional embeddings by a positional encoding mechanism (Gehring et al., 2017a) to explicitly encode the order of dependencies between words in a sentence.
The Transformer is adept at parallelizing of performing (multi-head) and stacking (multi-layer) SANs to learn the sentence representation to predict translation, and has delivered state-of-the-art performance on various translation tasks (Bojar et al., 2018;Marie et al., 2018). However, these positional embeddings focus on sequentially encoding order relations between words, and does not explicitly consider reordering information in a sentence, which may degrade the performance of Transformer translation systems. Thus, the reordering problem in NMT has not been studied extensively, especially in Transformer.
In this paper, we propose a reordering mechanism for the Transformer translation system. We dynamically penalize the given positional embedding of a word depending on its contextual information, thus generating a reordering embedding for each word. The reordering mechanism is then stacked together with the existing SANs to learn the final sentence representation with word reordering information.
The proposed method can be easily integrated into both the encoder and the decoder in the Transformer. Experimental results on the WMT14 Englishto-German, NIST Chinese-to-English, and WAT ASPEC Japanese-to-English translation tasks verify the effectiveness and universality of the proposed approach. This paper primarily makes the following contributions: • We propose a reordering mechanism to learn the reordering embedding of a word based on its contextual information, and thus these learned reordering embeddings are added to the sentence representation for archiving reordering of words. To the best of our knowledge, this is the first work to introduce the reordering information to the Transformer translation system.
• The proposed reordering mechanism can be easily integrated into the Transformer to learn reordering-aware sentence representation for machine translation. The proposed translation models outperform the state-of-the-art NMT baselines systems with a similar number of parameters and achieve comparable results compared to NMT systems with much more parameters.
2 Related Work

Reordering Model for PBSMT
In PBSMT, there has been a substantial amount of research works about reordering model, which was used as a key component to ensure the generation of fluent target translation. Bisazza and Federico (2016) divided these reordering models into four groups: Phrase orientation models (Tillman, 2004;Collins et al., 2005;Nagata et al., 2006;Zens and Ney, 2006;Galley and Manning, 2008;Cherry, 2013), simply known as lexicalized reordering models, predict whether the next translated source span should be placed on the right (monotone), the left (swap), or anywhere else (discontinuous) of the last translated one.
Jump models (Al-Onaizan and Papineni, 2006;Green et al., 2010) predict the direction and length of the jump that is performed between consecutively translated words or phrases, with the goal of better handling long-range reordering.
Source decoding sequence models (Feng et al., 2010(Feng et al., , 2013 address this issue by directly modeling the reordered sequence of input words, as opposed to the reordering operations that generated it.
Operation sequence models are n-gram models that include lexical translation operations and reordering operations in a single generative story, thereby combining elements from the previous three model families (Durrani et al., 2011(Durrani et al., , 2013(Durrani et al., , 2014. Their method were further extended by source syntax information (Chen et al., 2017c(Chen et al., , 2018b to improve the performance of SMT.
Moreover, to address data sparsity (Guta et al., 2015) caused by a mass of reordering rules, Li et al. (2013Li et al. ( , 2014 modeled ITG-based reordering rules in the translation by using neural networks. In particular, the NN-based reordering models can not only capture semantic similarity but also ITG reordering constraints (Wu, 1996(Wu, , 1997 in the translation context. This neural network modeling method is further applied to capture reordering information and syntactic coherence.

Modeling Ordering for NMT
The attention-based NMT focused on neural networks themselves to implicitly capture order dependencies between words (Sutskever et al., 2014;Bahdanau et al., 2015;Wang et al., 2017aWang et al., ,b, 2018. Coverage model can partially model the word order information (Tu et al., 2016;Mi et al., 2016). Inspired by a distortion method (Brown et al., 1993;Koehn et al., 2003;Al-Onaizan and Papineni, 2006) originated from SMT,  proposed an additional position-based attention to enable the existing content-based attention to attend to the source words regarding both semantic requirement and the word reordering penalty.
Pre-reordering, a pre-processing to make the source-side word orders close to those of the target side, has been proven very helpful for the SMT in improving translation quality. Moreover, neural networks were used to pre-reorder the sourceside word orders close to those of the target side (Du and Way, 2017;Zhao et al., 2018b;Kawara et al., 2018), and thus were input to the existing RNN-based NMT for improving the performance of translations. Du and Way (2017) and Kawara et al. (2018) reported that the prereordering method had an negative impact on the NMT for the ASPEC JA-EN translation task. In particular, Kawara et al. (2018) assumed that one reason is the isolation between pre-ordering and NMT models, where both models are trained using independent optimization functions.
In addition, several research works have been proposed to explicitly introduce syntax structure into the RNN-based NMT for encoding syntax ordering dependencies into sentence representations (Eriguchi et al., 2016;Li et al., 2017;Chen et al., 2017a,b;Wang et al., 2017b;Chen et al., 2018a). Recently, the neural Transformer translation system (Vaswani et al., 2017), which relies solely on self-attention networks, used a fixed order sequence of positional embeddings to encode order dependencies between words in a sentence.

Positional Encoding Mechanism
Transformer (Vaswani et al., 2017) typically uses a positional encoding mechanism to encode order dependencies between words in a sentence. Formally, given a embedding sequence of source sentence of length J, X={x 1 , · · · , x J }, the positional embedding is computed based on the position of each word by Eq.(1): where j is the word's position index in the sentence and i is the number of dimensions of the position index. As a result, there is a sequence of positional embeddings: Each pe j is then added to the corresponding word embedding x j as an combined embedding v j : Finally, a sequence of embeddings {v 1 , · · · , v J } is the initialized sentence representation H 0 . Later, H 0 will be input to the self-attention layer to learn the sentence representation.

Self-Attention Mechanism
Following the positional embedding layer, selfattention mechanism is used to learn sentence representation over the H 0 obtained in the previous section. Generally, the self-attention mechanism is a stack of N identical layers in the Transformer architecture. Each identical layer consists of two sub-layers: self-attention network, and position-wise fully connected feedforward network. A residual connection (He et al., 2016) is employed around each of two sub-layers, followed by layer normalization (Ba et al., 2016). Formally, the stack of learning the final sentence representation is organized as follows: where SelfAtt n (·), LN(·), and FFN n (·) are selfattention network, layer normalization, and feedforward network for the n-th identical layer, respectively. [· · · ] N denotes the stack of N identical layer. In the encoder and decoder of Transformer, SelfAtt n (·) computes attention over the output H n−1 of the n-1 layer: where {Q, K, V} are query, key and value vectors that are transformed from the input representations H n−1 . d k is the dimension size of the query and key vectors. As a result, the output of the N -th layer H N is the final sentence representation for machine translation.

Reordering Mechanism
Intuitively, when a human translates a sentence, he or she often adjusts word orders based on the global meaning of the original sentence or its context, thus gaining one synonymous sentence which is easier to be understood and translated. It is thus clear that the reordering of a given word relies heavily on the global or contextual meaning of the sentence. Motivated by this, we use the word and its global contextual information of the sentence to gain a Reordering Embedding for each word (as shown in Figure 1), thus modeling the above human reordering process. The reordering mechanism is then stacked with the SAN layer to learn a reordering-aware sentence representation. The output −1 of the n-1 layer in the stack: Original positional embeddings PE: Figure 1: Learning reordering embeddings for the n-th layer in the stack.

Reordering Embeddings
To capture reordering information, we first learn a positional penalty vector based on the given word and its global context of the sentence. The positional penalty vector is then used to penalize the given positional embedding of the word to generate a new, reordering embedding. Finally, these reordering embeddings are added to the intermediate sentence representation to achieve the reordering of words. We divide the process into the following three steps: Positional Penalty Vectors: The self-attention mechanism focuses on global dependencies between words to learn an intermediate sentence representation H n , which is regarded as the expected global context of the sentence as reordered by a human translator. Therefore, given a sentence of J words, we use the output H n−1 of the previous layer in the stack together with the new intermediate global context representation H n to learn positional penalty vectors PP n for the n-th layer of the stack [· · · ] N :  zero and one.
Reordering Embeddings: PP n is used to penalize the original positional embeddings PE: where RE n is called reordered embedding (RE) because each element of PE is multiplied by a probability between zero and one. Achieving Reordering: The learned RE n is further added to H n to achieve reordering operations for the current sentence hidden state H n : where LN is a layer normalization. As a result, there is a reordering-aware sentence hidden state representation C n .

Stacking SANs with Reordering Embeddings
The original positional embeddings of a sentence allow the Transformer to avoid having to recurrently capture the order of dependencies between words, thus relying entirely on the stacked SANs to parallel learn sentence representations. The learned REs are similar to the original positional embeddings. This means that these learned reordering embeddings can be also easily stacked together with the existing SANs to learn the final reordering-aware sentence representation for machine translation. According to Eq.(4), stacking SANs with reordering embeddings is formalized as the following Eq. (9): where H 0 is the initialized sentence representation as in the Section 3.1. Finally, there is a reorderingaware sentence representation H N for predicting translations.

Neural Machine Translation with Reordering Mechanism
Based on the proposed approach to learning sentence representation, we design three Transformer translation models: Encoder REs, Decoder REs, and Both REs, all of which enable reordering knowledge to improve the translation performance of Transformer. Encoder REs: The proposed reordering mechanism is only applied to the encoder of Transformer to learn the representation of the source sentence, as shown in the Encoder of Figure 2. Decoder REs: Similarly, the proposed reordering mechanism is only introduced into the SAN layer of Transformer related to the representation of the target sentence, as shown in the Decoder of Figure 2. Both REs: To further enhance translation performance, we simultaneously apply the proposed method to the source and target sentences to learn their sentence representations, as shown in Figure 2.
Note that the reordering model in PBSMT is an independent model and therefore needs to consider information concerning both the source and target. In NMT, the reordering embedding is jointly trained with the entire NMT model. Although it is only applied to the encoder (or decoder), it can still obtain information about the target (or source) from the decoder (or encoder) by neural network feedback. Therefore, the proposed reordering mechanism makes use of information concerning both the source and the target.

Datasets
The proposed method was evaluated on three tasks from the WMT14 English-to-German (EN-DE), NIST Chinese-to-English (ZH-EN), and WAT AS-PEC Japanese-to-English (JA-EN) benchmarks.
1) For the EN-DE translation task, 4.43 million bilingual sentence pairs of the WMT14 dataset were used as training data, including Common Crawl, News Commentary, and Europarl v7. The newstest2013 and newstest2014 datasets were used as the dev set and test set, respectively.
2) For the ZH-EN translation task, the training dataset consisted of 1.28 million bilingual sentence pairs from LDC corpus consisting of LDC2002E18, LDC2003E07, LDC2003E14, and Hansard's portions of LDC2004T07, LDC2004T08, and LDC2005T06. The MT06 and the MT02/MT03/MT04/MT05/MT08 datasets were used as the dev set and test set, respectively.
3) For the JA-EN translation task, the training dataset consisted of two million bilingual sentence pairs from the ASPEC corpus (Nakazawa et al., 2016). The dev set consisted of 1,790 sentence pairs and the test set of 1,812 sentence pairs.

Baseline Systems
These baseline systems included: Transformer: a vanilla Transformer with absolute positional embedding (Vaswani et al., 2017), for example Transformer (base) and Transformer (big) models.
Relative PE (Shaw et al., 2018): incorporates relative positional embeddings into the selfattention mechanism of Transformer.
Additional PE (control experiment): uses original absolute positional embeddings to enhance the position information of each SAN layer instead of the proposed reordering embeddings.
Pre-reordering: a pre-ordering method (Goto et al., 2013) for JA-EN translation task was used to adjust the order of Japanese words in both the training, dev, and test datasets, and thus reordered each source sentence into the similar order as its target sentence.

System Setting
For all models (base), the byte pair encoding algorithm (Sennrich et al., 2016) was adopted and the size of the vocabulary was set to 32,000. The number of dimensions of all input and output  Table 1: Comparison with existing NMT systems on WMT14 EN-DE Translation Task. "#Speed1" and "#Speed2" denote the training and decoding speed measured in source tokens per second, respectively. In Table 1, 2 and 3, "++/+" after score indicate that the proposed method was significantly better than the corresponding baseline Transformer (base or big) at significance level p<0.01/0.05.
layers was set to 512, and that of the inner feedforward neural network layer was set to 2048. The heads of all multi-head modules were set to eight in both encoder and decoder layers. In each training batch, a set of sentence pairs contained approximately 4096×4 source tokens and 4096×4 target tokens. During training, the value of label smoothing was set to 0.1, and the attention dropout and residual dropout were p = 0.1. The Adam optimizer (Kingma and Ba, 2014) was used to tune the parameters of the model. The learning rate was varied under a warm-up strategy with warmup steps of 8,000. For evaluation, we validated the model with an interval of 1,000 batches on the dev set. Following the training of 200,000 batches, the model with the highest BLEU score of the dev set was selected to evaluate on the test sets. During the decoding, the beam size was set to four. All models were trained and evaluated on a single P100 GPU. SacreBELU (Post, 2018) was used as the evaluation metric of EN-DE, and the multi-bleu.perl 1 was used the evaluation metric of ZH-EN and JA-EN tasks. The signtest (Collins et al., 2005) was as statistical significance test.

Main Results
To validate the effectiveness of our methods, the proposed models were first evaluated on the WMT14 EN-DE translation task as in the original Transformer translation system (Vaswani et al., 2017). The main results of the translation are shown in Tables 1. We made the following observations: 1) The baseline Transformer (base) in this work outperformed GNMT, CONVS2S, and Transformer (base)+Relative PEs, and achieved performance comparable to the original Transformer (base). This indicates that it is a strong baseline NMT system.
2) The three proposed models significantly outperformed the baseline Transformer (base). This indicates that the learned reordering embeddings were beneficial for the Transformer. Meanwhile, our models outperformed the comparison system +Additional PEs (control experiment), which means that these improvements in translation derived from the learned REs instead of the original PEs. +Encoder REs and +Both REs were superior to +Relative PEs, which means that the REs better captured reordering information than +Relative PEs.
3) Of the proposed models, +Encoder REs  performed slightly better than +Decoder REs. This indicates that the reordering information of the source sentence was slightly more useful than that of the target sentence. +Both REs which combined reordering information for both source and target further improved performance and were significantly better than +Encoder REs and +Decoder REs. This indicates that the reordering information of source and target can be used together to improve predicted translation. 4) We also evaluated the best performing method (+Both REs) in big Transformer model settings (Vaswani et al., 2017). Compared with Transformer (base), Transformer (big) contains approximately three times parameters and obtained one BLEU score improvement. The Transformer (big)+Both REs further achieved 0.77 BLEU score improvement.
5) The proposed models contains approximately 5%∼10% additional parameters and decreased 10%∼15% training speed, compared to the corresponding baselines. Transformer (base)+Both REs achieved comparable results compared to Transformer (big) which has much more parameters.
This indicates that the improvement of the proposed methods is not from more parameters. 6) In Table 3, the +Pre-ordering performed worse that the baseline Transformer (base) for the WAT JA-EN translation task. We assume that the simple +pre-ordering strategy has negative impact on the translation performance of NMT model, which is in line with the functional  similarity findings in (Du and Way, 2017; Kawara et al., 2018). Conversely, the proposed methods performed better than the Transformer (base), especially the +pre-ordering. This means that because this pre-ordering operation is isolated with the existing NMT, these generated preordered data are not conducive to model source translation knowledge for the NMT framework.
In addition, Tables 2 and 3 show that the proposed models yielded similar improvements over the baseline system and the compared methods on the NIST ZH-EN and WAT JA-EN translation tasks. These results indicate that our method can effectively improve the NIST ZH-EN and WAT JA-EN translation tasks. In other words, our approach is a universal method for improving the translation of other language pairs.

Effect of Reordering Embeddings
Unlike the reordering model in PBSMT, which can be illustrated explicitly, it is challenging to explicitly show the effect of reordering embedding. To further analyze this effect, we simulated a scenario where the word order of a sentence was partially incorrect and reordering was needed for NMT. We randomly swapped words of a source sentence in the test set according to different percentages of incorrectly swapped words in a sentence. For example, "10%" indicates that there were 10% randomly swapped words for each source sentence in the test set. We evaluated Transformer (base) and +Both REs (base) on these test set for three translation tasks and the results are as shown in Figure 3, 4, and 5. 1) We observed that when the ratio of swapped words gradually increased, the performances of Transformer (base) and +Both REs (base) significantly degraded. This indicates that correct ordering information has an important effect on the Transformer system. 2) When the percentage of swapped words was less than 40%, the NMT systems still delivered reasonable performance.
The gap between +Both REs (base) and Transformer (base) was approximately 2-3 BLEU scores. This indicates that +Both REs (base) dealt better than the vanilla baseline with this scenario. In other words, the learned REs retrained part of reordering information in a sentence.
3) When the percentage of swapped words was greater than 40%, Transformer (base) and +Both REs (base) yielded poor performance on translation. We infer that excessive exchanges of word order may increase the ambiguity of the source sentence such that Transformer (base) and +Both REs (base) struggled to convert the original meaning of the source sentence into the target translation. For the first sample, +Both REs (  (base) translated the Chinese phrase into "continued reform efforts". Although both of them covered the meanings of main words, the order of the former translation is closer to the natural English word order. For the second sample, Transformer (base) generated a puzzling translation "the incident killed nine people". It seems to be an English sentence in Chinese word order. In comparison, the +Both REs (base) translated it into "nine people were killed in the incident" which is the same as the reference.

Cases Analysis
These two examples show that the proposed model with reordering embeddings was conducive to generating a translation in line with the target language word order.

Conclusion and Future Work
Word ordering is an important issue in translation. However, it has not been extensively studied in NMT. In this paper, we proposed a reordering mechanism to capture knowledge of reordering. A reordering embedding was learned by considering the relationship between the positional embedding of a word and that of the entire sentence. The proposed reordering embedding can be easily introduced to the existing Transformer translation system to predict translations. Experiments showed that our method can significantly improve the performance of Transformer.
In future work, we will further explore the effectiveness of the reordering mechanism and apply it to other natural language processing tasks, such dependency parsing , and semantic role labeling Li et al., 2019).