Incorporating Word Reordering Knowledge into Attention-based Neural Machine Translation

This paper proposes three distortion models to explicitly incorporate the word reordering knowledge into attention-based Neural Machine Translation (NMT) for further improving translation performance. Our proposed models enable attention mechanism to attend to source words regarding both the semantic requirement and the word reordering penalty. Experiments on Chinese-English translation show that the approaches can improve word alignment quality and achieve significant translation improvements over a basic attention-based NMT by large margins. Compared with previous works on identical corpora, our system achieves the state-of-the-art performance on translation quality.


Introduction
Word reordering model is one of the most crucial sub-components in Statistical Machine Translation (SMT) (Brown et al., 1993;Koehn et al., 2003;Chiang, 2005) which provides word reordering knowledge to ensure reasonable translation order of source words. It is separately trained and then incorporated into the SMT framework in a pipeline style.
The attention mechanism evaluates the distribution of to-be-translated source words in a content-based addressing fashion (Graves et al., 2014) which tends to attend to the source words regarding the content relation with current translation status. Lack of explicit models to exploit the word reordering knowledge may lead to attention faults and generate fluent but inaccurate or inadequate translations. Table 1 shows a translation instance and Figure 1 depicts the corresponding word alignment matrix that produced by the attention mechanism. In this example, even though the word "zuixin (latest)" is a common adjective in Chinese and its following word should be translated soon in Chinese to English translation direction, the word "yiju (evidence)" does not obtain appropriate attention which leads to the incorrect translation. src youguan(related) baodao(report) shi(is) zhichi(support) tamen(their) lundian(arguments) de('s) zuixin(latest) yiju(evidence) . ref the report is the latest evidence that supports their arguments .
NMT the report supports their perception of the latest . count zuixin yiju {0} Table 1: An instance in Chinese-English translation task. The row "count" represents the frequency of the word collocation in the training corpus. The collocation "zuixin yiju" does not appear in the training data. Figure 1: The source word "yiju" does not obtain appropriate attention and its word sense is completely neglected.
To enhance the attention mechanism, implicit word reordering knowledge needs to be incorporated into attention-based NMT. In this paper, we introduce three distortion models that originated from SMT (Brown et al., 1993;Koehn et al., 2003;Och et al., 2004;Tillmann, 2004;Al-Onaizan and Papineni, 2006), so as to model the word reordering knowledge as the probability distribution of the relative jump distances between the newly translated source word and the to-be-translated source word. Our focus is to extend the attention mechanism to attend to source words regarding both the semantic requirement and the word reordering penalty.
Our models have three merits: 1. Extended word reordering knowledge. Our models capture explicit word reordering knowledge to guide the attending process for attention mechanism.
2. Convenient to be incorporated into attention-based NMT. Our distortion models are differentiable and can be trained in the end-to-end style. The interpolation approach ensures that the proposed models can coordinately work with the original attention mechanism.
3. Flexible to utilize variant context for computing the word reordering penalty. In this paper, we exploit three categories of information as distortion context conditions to compute the word reordering penalty, but variant context information can be utilized due to our model's flexibility.
We validate our models on the Chinese-English translation task and achieve notable improvements: • On 16K vocabularies, NMT models are usually inferior in comparison with the phrase-based SMT, but our model surpasses phrase-based Moses by average 4.43 BLEU points and outperforms the attention-based NMT baseline system by 5.09 BLEU points.
• On 30K vocabularies, the improvements over the phrase-based Moses and the attention-based NMT baseline system are average 6.06 and 1.57 BLEU points respectively.
• Compared with previous work on identical corpora, we achieve the state-of-theart translation performance on average.
The word alignment quality evaluation shows that our model can effectively improve the word alignment quality that is crucial for improving translation quality.

Background
We aim to capture word reordering knowledge for the attention-based NMT by incorporating distortion models. This section briefly introduces attention-based NMT and distortion models in SMT.

Attention-based Neural Machine Translation
Formally, given a source sentence x = x 1 , ..., x m and a target sentence y = y 1 , ..., y n , NMT models the translation probability as where y <t = y 1 , ..., y t−1 . The generation probability of y t is P (y t |y <t , x) = g(y t−1 , c t , s t ), where g(·) is a softmax regression function, y t−1 is the newly translated target word and  Figure 2: The general architecture of our proposed models. The dash line represents variant context can be utilized to determine the word reordering penalty. s t is the hidden states of decoder which represents the translation status.
The attention c t denotes the related source words for generating y t and is computed as the weighted-sum of source representation h upon an alignment vector α t shown in Eq.(3) where the align(·) function is a feedforward network with sof tmax normalization.
The hidden states s t is updated as where f (·) is a recurrent function. We adopt a varietal attention mechanism 1 in our in-house RNNsearch model which is implemented as where f 1 (·) and f 2 (·) are recurrent functions. As shown in Eq.(3), the attention mechanism attends to source words in a contentbased addressing way without considering any explicit word reordering knowledge. We introduce distortion models to capture explicit word reordering knowledge for enhancing the attention mechanism and improving translation quality.

Distortion Models in SMT
In SMT, distortion models are linearly combined with other features, as follows, where d(·) is the distortion feature, h r (·) represents other features, λ d and λ r are the weights, b is the latent variable that represents translation knowledge and R is the number of features.
IBM Models (Brown et al., 1993) depicted the word reordering knowledge as positional relations between source and target words. Koehn et al. (2003) proposed a distortion model for phrase-based SMT based on jump distances between the newly translated phrases and to-be-translated phrases which does not consider specific lexical information. Och et al. (2004) and Tillmann (2004) proposed orientation-based distortion models that consider translation orientations. Yaser and Papineni (2006) proposed a distortion model to estimate probability distribution on possible relative jumps conditioned on source words.
These models are proposed for SMT and separately trained as sub-components. Inspired by these previous work, we introduce the distortion models into NMT model for modeling the word reordering knowledge. Our proposed models are designed for NMT which can be trained in the end-to-end style.

Distortion Models for attention-based NMT
The basic idea of our proposed distortion models is to estimate the probability distribution of the possible relative jump distances between the newly translated source word and the tobe-translated source word upon the context condition. Figure 2 shows the general architecture of our proposed model.

General Architecture
We employ an interpolation approach to incorporate distortion models into attention-based NMT as Figure 3: Illustration of shift actions of the alignment vector α t−1 . If α t is the left shift of α t−1 , it represents the translation orientation of the source sentence is backward and if α t is the right shift of α t−1 , the translation orientation is forward.
where α t is the ultimate alignment vector for computing the related source context c t , d t is the alignment vector calculated by the distortion model,α t is the alignment vector computed by the basic attention mechanism and λ is a hyper-parameter to control the weight of the distortion model.
In the proposed distortion model, relative jumps on source words are depicted as the "shift" actions of the alignment vector α t−1 which is shown in the Figure 3. The right shift of α t−1 indicates that the translation orientation of source words is forward and the left shift represents that the translation orientation is backward. The extent of a shift action measures the word reordering distance. Alignment vector d t , which is produced by the distortion model, is the expectation of all possible shifts of α t−1 conditioned on certain context. Formally, the proposed distortion model is where k ∈ [−l, l] is the possible relative jump distance, l is the window size parameter and P (k|Ψ) stands for the probability of jump distance k that conditioned on the context Ψ. Function Γ(·) for shifting the alignment vector is defined as which can be implemented as matrix multiplication computations.
We respectively exploit source context, target context and translation status context (hidden states of decoder) as Ψ and derive three distortion models: Source-based Distortion (S-Distortion) model , Targetbased Distortion (T-Distortion) model and Translation-status-based Distortion (H-Distortion) model. Our framework is capable of utilizing arbitrary context as the condition Ψ to predict the relative jump distances.

S-Distortion model
S-Distortion model adopts previous source context c t−1 as the context Ψ with the intuition that certain source word indicate certain jump distance. The to-be-translated source word have intense positional relations with the newly translated one.
The underlying linguistic intuition is that synchronous grammars (Yamada and Knight, 2001;Galley et al., 2004) can be extracted from language pairs. Word categories such as verb, adjective and preposition carry general word reordering knowledge and words carry specific word reordering knowledge.
To further illustrate this idea, we present some common synchronous grammar rules that can be extracted from the example in Table 1 as follows, From the above grammar, we can conjecture the speculation that after the word "zuixin(latest)" is translated, the translation orientation is forward with shift distance 1. The probability function in S-Distortion model is defined as follows, where W c ∈ R (2l+1)×dim(c t−1 ) and b c ∈ R 2l+1 are weight matrix and bias parameters.

T-Distortion Model
T-Distortion model exploits the embedding of the previous generated target word y t−1 as the context condition to predict the probability distribution of distortion distances. It focuses on the word reordering knowledge upon target word context. As illustrated in Eq.(10), the target word "latest" possesses word reordering knowledge that is identical with source word "zuixin". The probability function in T-Distortion model is defined as follows, where emb(y t−1 ) is the embedding of y t−1 , W y ∈ R (2l+1)×dim(emb(y t−1 )) and b y ∈ R 2l+1 are weight matrix and bias parameters.

H-Distortion Model
The hidden statess t−1 reflect the translation status and contains both source context and target context information. Therefore, we exploits t−1 as context Ψ in the H-Distortion model to predict shift distances. The probability function in H-Distortion model is defined as follows, where W s ∈ R (2l+1)×dim(s t−1 ) and b s ∈ R 2l+1 are the weight matrix and bias parameters.

Experiments
We carry the translation task on the Chinese-English direction to evaluate the effectiveness of our models. To investigate the word alignment quality, we take the word alignment quality evaluation on the manually aligned corpus. We also conduct the experiments to observe effects of hyper-parameters and the training strategies.  (Liu and Sun, 2015) which contains 900 manually aligned sentence pairs. Metrics: The translation quality evaluation metric is the case-insensitive 4-gram BLEU 3 (Papineni et al., 2002). Sign-test (Collins et al., 2005) is exploited for statistical significance test. Alignment error rate (AER) (Och and Ney, 2003) is calculated to assess the word alignment quality.

Comparison Systems
We compare our approaches with three baseline systems: Moses (Koehn et al., 2007): An open source phrase-based SMT system with default settings. Words are aligned with GIZA++ (Och and Ney, 2003). The 4-gram language model with modified Kneser-Ney smoothing is trained on the target portion of training data by SRILM (Stolcke et al., 2002). Groundhog 4 : An open source attentionbased NMT system with default settings. RNNsearch * : Our in-house implementation of NMT system with the varietal attention mechanism and other settings that presented in section 4.3.

Training
Hyper parameters: The sentence length for training NMTs is up to 50, while SMT model exploits whole training data without any restrictions. Following Bahdanau et al. (2015), we use bi-directional Gated Recurrent Unit (GRU) as the encoder. The forward representation and the backward representation are concatenated at the corresponding position as the ultimate representation of a source word. The word embedding dimension is set to 620 and the hidden layer size is 1000. The interpolation parameter λ is 0.5 and the window size l is set to 3.

Training details:
Square matrices are initialized in a random orthogonal way. Non-square matrices are initialized by sampling each element from the  Gaussian distribution with mean 0 and variance 0.01 2 . All bias are initialized to 0. Parameters are updated by Mini-batch Gradient Descent and the learning rate is controlled by the AdaDelta (Zeiler, 2012) algorithm with decay constant ρ = 0.95 and denominator constant ϵ = 1e − 6. The batch size is 80. Dropout strategy (Srivastava et al., 2014) is applied to the output layer with the dropout rate 0.5 to avoid over-fitting. The gradients of the cost function which have L2 norm larger than a predefined threshold 1.0 is normalized to the threshold to avoid gradients explosion (Pascanu et al., 2013). We exploit length normalization (Cho et al., 2014a) on candidate translations and the beam size for decoding is 12. For NMT with distortion models, we use trained RNNsearch * model to initialize parameters except for those related to distortions.

Results
The translation quality experiment results are shown in Table 2. We carry the experiments on different vocabulary sizes for that different vocabulary sizes cause different degrees of the rare word collocations. Through this way, we can validate the effects of our proposed models in alleviating the rare word collocations problem that leads to incorrect word alignments. On 16K vocabularies: The phrase-based Moses performs better than the basic NMTs including Groundhog and RNNsearch * . Be-sides the differences between model architectures, restricted vocabularies and sentence length also affect the performance of NMTs. However, RNNsearch * with distortion models surpass phrase-based Moses by average 3.60, 4.27 and 4.43 BLEU points. RNNsearch * outperforms Groundhog by average 1.96 BLEU points due to the varietal attention mechanism, length normalization and dropout strategies. Distortion models bring about remarkable improvements as 4.26, 4.92 and 5.09 BLEU points over the RNNsearch * model. On 30K vocabularies: RNNsearch * with distortion models yield average gains by 1.57, 1.21 and 1.45 BLEU points over RNNsearch * and outperform phrase-based Moses by average 6.06, 5.70 and 5.94 BLEU points and surpass GroundHog by average 5.56, 5.20 and 5.44 BLEU points. RNNsearch * (16K) with distortion models achieve close performances with RNNsearch * (30K). The improvements on 16K vocabularies are larger than that on 30K vocabularies for the intuition that more "UN-K" words lead to more rare word collocations, which results in serious attention ambiguities.
The RNNsearch * with distortion models yield tremendous improvements on BLEU scores proves the effectiveness of proposed approaches in improving translation quality. Comparison with previous work: We present the performance comparison with pre-  Table 3: Comparison with previous work on identical training corpora. Coverage (Tu et al., 2016) is a basic RNNsearch model with a coverage model to alleviate the over-translation and under-translation problems. MEMDEC  is to improve translation quality with external memory. NMT IA (Meng et al., 2016) exploits a readable and writable attention mechanism to keep track of interactive history in decoding. Our work is NMT with H-Distortion model. The vocabulary sizes of all work are 30K and maximum lengths of sentence differ.  Table 4: BLEU-4 scores (%) and AER scores on Tsinghua manually aligned Chinese-English evaluation set. The lower the AER score, the better the alignment quality.
vious work that employ identical training corpora in Table 3. Our work evidently outperforms previous work on average performance. Although we restrict the maximum length of sentence to 50, our model achieves the state-of-the-art BLEU scores on almost all test sets except NIST2006.

Analysis
We investigate the effects on the alignment quality of our models and conduct the experiments to evaluate the influence of the hyperparameter settings and the training strategies.

Alignment Quality
Distortion models concentrate on attending to to-be-translated words based on the word reordering knowledge and can intuitively enhance the word alignment quality. To investigate the effect on word alignment quality, we apply the BLEU and AER evaluations on Tsinghua manually aligned data set.    Table 4 lists the BLEU and AER scores of Chinese-English translation with 30K vocabulary. RNNsearch*(30K) with distortion models achieve significant improvements on BLEU scores and obvious decrease on AER scores. The results shows that the proposed model can effectively improve the word alignment quality Figure 4 shows the output of distortion model and ultimate alignment matrix of the above-mentioned instance. Compared with Figure 1, the alignment matrix produced by NMT with distortion models is more concentrated and accurate. The output of distortion model shows its capacity of modeling word reordering knowledge.

Effect of Hyper-parameters
To investigate the effect of the weight hyperparameter λ and window hyper-parameter l in the proposed model, we carry experiments on H-Distortion model with variable hyperparameter settings. We fix l = 3 for exploring the effect of λ and fix λ = 0.5 for observing the effect of l. Figure 5 presents the translation performances with respect to hyperparameters.
With the increase of weight λ, the BLEU scores first rise and then drop, which shows the distortion model provides additional helpful information while can not fully cover the attention mechanism for its insufficient content searching ability. For window l, the experiments show that larger windows bring slight further improvements, which indicates that distortion model pays more attention to the short-distance reordering knowledge.

Pre-training VS No Pre-training
We conduct the experiment without using pretraining strategy to observe the effect of the initialization. As is shown in Table 5, the no-pre-training model achieves consistent improvements with the pre-training one which verifies the stable effectiveness of our approach. Initialization with pre-training strategy provides a fast approach to obtain the model for it needs fewer training iterations.

Related Work
Our work is inspired by the distortion models that widely used in SMT. The most related work in SMT is the distortion model proposed by Yaser and Papineni (2006). Their model is identical to our S-Distortion model that captures the relative jump distance knowledge on source words. However, our approach is deliberately designed for the attention-based NMT system and is capable of exploiting variant context information to predict the relative jump distances.
Our work is related to the work (Luong et al., 2015a;Feng et al., 2016;Tu et al., 2016;Cohn et al., 2016;Meng et al., 2016; that concentrate on the improvement of the attention mechanism. To remit the computing cost of the attention mechanism when dealing with long sentences, Luong et al. (2015a) proposed the local attention mechanism by just focusing on a subscope of source positions. Cohn et al. (2016) incorporated structural alignment biases into the attention mechanism and obtained improvements across several challenging language pairs in low-resource settings. Feng et al. (2016) passed the previous attention context to the attention mechanism by adding recurrent connections as the implicit distortion model. Tu et al. (2016) maintained a coverage vector for keeping the attention history to acquire accurate translations. Meng et al. (2016) proposed the interactive attention with the attentive read and attentive write operation to keep track of the interaction history.  utilized an external memory to store additional information for guiding the attention computation. These works are different from ours, as our distortion models explicitly capture word reordering knowledge through estimating the probability distribution of relative jump distances on source words to incorporate word reordering knowledge into the attention-based NMT.

Conclusions
We have presented three distortion models to enhance attention-based NMT through incorporating the word reordering knowledge. The basic idea of proposed distortion models is to enable the attention mechanism to attend to the source words regarding both semantic requirement and the word reordering penalty. Experiments show that our models can evidently improve the word alignment quality and translation performance. Compared with previous work on identical corpora, our model achieves the state-of-the-art performance on average. Our model is convenient to be applied in the attention-based NMT and can be trained in the end-to-end style. We also investigated the effect of hyper-parameters and pre-training strategy and further proved the stable effectiveness of our model. In the future, we plan to validate the effectiveness of our model on more language pairs.