LSTM Neural Reordering Feature for Statistical Machine Translation

Artificial neural networks are powerful models, which have been widely applied into many aspects of machine translation, such as language modeling and translation modeling. Though notable improvements have been made in these areas, the reordering problem still remains a challenge in statistical machine translations. In this paper, we present a novel neural reordering model that directly models word pairs and alignment. By utilizing LSTM recurrent neural networks, much longer context could be learned for reordering prediction. Experimental results on NIST OpenMT12 Arabic-English and Chinese-English 1000-best rescoring task show that our LSTM neural reordering feature is robust and achieves significant improvements over various baseline systems.


Introduction
In statistical machine translation, the language model, translation model, and reordering model are the three most important components. Among these models, the reordering model plays an important role in phrase-based machine translation , and it still remains a major challenge in current study.
In recent years, various phrase reordering methods have been proposed for phrase-based SMT systems, which can be classified into two broad categories: (1) Distance-based RM: Penalize phrase displacements with respect to the degree of nonmonotonicity .
Furthermore, some researchers proposed a reordering model that conditions both current and previous phrase pairs by utilizing recursive autoencoders (Li et al., 2014).
In this paper, we propose a novel neural reordering feature by including longer context for predicting orientations. We utilize a long short-term memory recurrent neural network (LSTM-RNN) (Graves, 1997), and directly models word pairs to predict its most probable orientation. Experimental results on NIST OpenMT12 Arabic-English and Chinese-English translation show that our neural reordering model achieves significant improvements over various baselines in 1000-best rescoring task.

Related Work
Recently, various neural network models have been applied into machine translation.
Feed-forward neural language model was first proposed by Bengio et al. (2003), which was a breakthrough in language modeling. Mikolov et al. (2011) proposed to use recurrent neural network in language modeling, which can include much longer context history for predicting next word. Experimental results show that RNN-based language model significantly outperform standard feed-forward language model. Devlin et al. (2014) proposed a neural network joint model (NNJM) by conditioning both source and target language context for target word predicting. Though the network architecture is a simple feed-forward neural network, the results have shown significant improvements over state-of-the-art baselines. Sundermeyer et al. (2014) also put forward a neural translation model, by utilizing LSTM-based RNN and bidirectional RNN. By introducing bidirectional RNNs, the target word is conditioned on not only the history but also future source context, which forms a full source sentence for predicting target words. Li et al. (2013) proposed to use a recursive autoencoder (RAE) to map each phrase pairs into continuous vectors, and handle reordering problems with a classifier. Also, they suggested that by both including current and previous phrase pairs to determine the phrase orientations could achieve further improvements in reordering accuracy (Li et al., 2014).
By far, we have noticed that this is the first time to use LSTM-RNN in reordering model. We could include much longer context information to determine phrase orientations using RNN architecture. Furthermore, by utilizing the LSTM units, the network is able to capture much longer range dependencies than standard RNNs.
Because we need to record fixed length of history information in SMT decoding step, we only utilize our LSTM-RNN reordering model as a feature in 1000-best rescoring step. As word alignments are known after generating n-best list, it is possible to use LSTM-RNN reordering model to score each hypothesis.

Lexicalized Reordering Model
In traditional statistical machine translation, lexicalized reordering models have been widely used (Koehn et al., 2007). It considers alignments of current and previous phrase pairs to determine the orientation.
Formally, when given source language sentence f = {f 1 , ..., f n }, target language sentence e = {e 1 , ..., e n }, and phrase alignment a = {a 1 , ..., a n }, the lexicalized reordering model can be illustrated in Equation 1, which only conditions on a i−1 and a i , i.e. previous and current alignment.
In Equation 1, the o i represents the set of phrase orientations. For example, in the most commonly used MSD-based orientation type, o i takes three values: M stands for monotone, S for swap, and D for discontinuous. The definition of MSD-based orientation is shown in Equation 2.
For other orientation types, such as LR and MSLR are also widely used, whose definition can be found on Moses official website 1 .
Recent studies on reordering model suggest that by also conditioning previous phrase pairs can improve context sensitivity and reduce reordering ambiguity.

LSTM Neural Reordering Model
In order to include more context information for determining reordering, we propose to use a recurrent neural network, which has been shown to perform considerably better than standard feed-forward architectures in sequence prediction (Mikolov et al., 2011). However, RNN with conventional backpropagation training suffers from gradient vanishing issues (Bengio et al., 1994) .
Later, long short-term memory was proposed for solving gradient vanishing problem, and it could catch longer context than standard RNNs with sigmoid activation functions. In this paper, we adopt LSTM architecture for training neural reordering model.

Training Data Processing
For reducing model complexity and easy implementation, our neural reordering model is purely lexicalized and trained on word-level.
We will take LR orientation type for explanations, while other orientation types (MSD, MSLR) can be induced similarly. Given a sentence pair and its alignment information, we can induce the wordbased reordering information by following steps. Note that, we always evaluate the model in the order of target sentence.
(1) If current target word is one-to-one alignment, then we can directly induce its orientations, i.e. lef t or right .
(2) If current source/target word is one-to-many alignment, then we judge its orientation by considering its first aligned target/source word, and the other aligned target/source words are annotated as f ollow reordering type, which means these word pairs inherent the orientation of previous word pair.
(3) If current source/target word is not aligned to any target/source words, we introduce a null token in its opposite side, and annotate this word pair as f ollow reordering type. Figure 1 shows an example of data processing.

LSTM Network Architecture
After processing the training data, we can directly utilize the word pairs and its orientation to train a neural reordering model. Given a word pair and its orientation, a neural reordering model can be illustrated by Equation 3.
Where e i 1 = {e 1 , ..., e i }, f a i 1 = {f 1 , ..., f a i }. Inclusion of history word pairs is done with recurrent neural network, which is known for its capability of learning history information.
The architecture of LSTM-RNN reordering model is depicted in Figure 2, and corresponding equations are shown in Equation 4 to 6.
The input layer consists both source and target language word, which is in one-hot representation. Then we perform a linear transformation of input layer to a projection layer, which is also called embedding layer. We adopt extended-LSTM as our hidden layer implementation, which consists of three gating units, i.e. input, forget and output gates. We omit rather extensive LSTM equations here, which can be found in (Graves and Schmidhuber, 2005). The output layer is composed by orientation types. For example, in LR condition, the output layer contains two units: lef t and right orientation. Finally, we apply softmax function to obtain normalized probabilities of each orientation.    (Koehn et al., 2007). Word alignment and phrase extraction are done by GIZA++ (Och and Ney, 2000) with L0-normalization (Vaswani et al., 2012), and grow-diag-final refinement rule . Monolingual part of training data is used to train a 5-gram language model using SRILM (Stolcke, 2002). Parameter tuning is done by K-best MIRA (Cherry and Foster, 2012). For guarantee of result stability, we tune every system 5 times independently, and take the average BLEU score (Clark et al., 2011). The translation quality is evaluated by case-insensitive BLEU-4 metric (Papineni et al., 2002). The statistical significance test is also carried out with paired bootstrap resampling method with p < 0.001 intervals . Our models are evaluated in a 1000-best rescoring step, and all features in 1000-best list as well as LSTM-RNN reordering feature are retuned via K-best MIRA algorithm.
For neural network training, we use all parallel text in the baseline training. As a trade-off between computational cost and performance, the projection layer and hidden layer are set to 100, which is enough for our task (We have not seen significant gains when increasing dimensions greater than 100). We use an initial learning rate of 0.01 with standard SGD optimization without momentum. We trained model for a total of 10 epochs with crossentropy criterion. Input and output vocabulary are 2 https://catalog.ldc.upenn.edu/LDC2010L01 set to 100K and 50K respectively, and all out-ofvocabulary words are mapped to a unk token.

Results on Different Orientation Types
At first, we test our neural reordering model (NRM) on the baseline that contains word-based reordering model with LR orientation. The results are shown in Table 2 and 3.
As we can see that, among various orientation types (LR, MSD, MSLR), our model could give consistent improvements over baseline system. The overall BLEU improvements range from 0.42 to 0.79 for Arabic-English, and 0.31 to 0.72 for Chinese-English systems. All neural results are significantly better than baselines (p < 0.001 level).
In the meantime, we also find that "Left-Right" based orientation methods, such as LR and MSLR, consistently outperform MSD-based orientations. The may caused by non-separability problem, which means that MSD-based methods are vulnerable to the change of context, and weak in resolving reordering ambiguities. Similar conclusion can be found in Li et al. (2014) .

Results on Different Reordering Baselines
We also test our approach on various baselines, which either contains word-based, phrase-based, or hierarchical phrase-based reordering model. We only show the results of MSLR orientation, which is relatively superior than others according to the results in Section 5.2.  Table 4: Results on various baselines for Arabic-English and Chinese-English system. "wbe": word-based; "phr": phrasebased; "hier": hierarchical phrase-based reordering model. All NRM results are significantly better than baselines (p < 0.001 level).

Ar-En System
In Table 4 and 5, we can see that though we add a strong hierarchical phrase-based reordering model in the baseline, our model can still bring a maximum gain of 0.59 BLEU score, which suggest that our model is applicable and robust in various circumstances. However, we have noticed that the gains in Arabic-English system is relatively greater than that in Chinese-English system. This is probably because hierarchical reordering features tend to work better for Chinese words, and thus our model will bring little remedy to its baseline.

Conclusions
We present a novel work that build a reordering model using LSTM-RNN, which is much sensitive to the change of context and introduce rich context information for reordering prediction. Furthermore, the proposed model is purely lexicalized and straightforward, which is easy to realize. Experimental results on 1000-best rescoring show that our neural reordering feature is robust, and could give consistent improvements over various baseline systems.
In future, we are planning to extend our wordbased LSTM reordering model to phrase-based reordering model, in order to dissolve much more ambiguities and improve reordering accuracy. Further-more, we are also going to integrate our neural reordering model into neural machine translation systems.