Learning Word Reorderings for Hierarchical Phrase-based Statistical Machine Translation

Statistical models for reordering source words have been used to enhance the hierarchical phrase-based statistical machine translation system. Existing word reordering models learn the reordering for any two source words in a sentence or only for two continuous words. This paper proposes a series of separate sub-models to learn reorderings for word pairs with different distances. Our experiments demonstrate that reordering sub-models for word pairs with distance less than a speciﬁc threshold are useful to improve translation quality. Compared with previous work, our method may more effectively and efﬁ-ciently exploit helpful word reordering information.


Introduction
The hierarchical phrase-based model (Chiang, 2005) is capable of capturing rich translation knowledge with the synchronous context-free grammar. But selecting proper translation rules during decoding is a challenge as a huge number of hierarchical rules can be applied to one source sentence. Chiang (2005) used a log-linear model to compute rule weights with features similar to Pharaoh (Koehn et al., 2003). However, to select appropriate rules, more effective criteria are required. A lot of work has been done for better rule selection.  and  used maximum entropy approaches to integrate rich contextual information for target side rule selection. Cui et al. (2010) proposed a joint model to select hierarchical rules for both source and target sides. Hayashi et al. (2010) demonstrated the effectiveness of using word reordering information within hierarchical phrase-based SMT by integrating Tromble and Eisner (2009)'s word reordering model into decoder as a feature, which estimates the probability of any two source words in a sentence being reordered during translating. Feng et al. (2013) proposed a word reordering model to learn reorderings only for continuous words, which reduced computation cost a lot compared with Tromble and Eisner (2009)'s model and still achieved significant reordering improvement over the baseline system.
In this paper, we incorporate word reordering information into hierarchical phrase-based SMT by training a series of separate reordering submodels for word pairs with different distances. We will demonstrate that the translation performance achieves consistent improvement as more sub-models for longer distance reorderings being integrated, but the improvement levels off quickly. That means sub-models for reordering distance longer than a given threshold do not improve translation quality significantly. Compared with previous models (Tromble and Eisner, 2009;Feng et al., 2013), our method makes full use of helpful word reordering information and also avoids unnecessary computation cost for long distance reorderings. Besides, our reordering model is learned by feed-forward neural network (FNN) for better performance and uses efficient caching strategy to further reduce time cost.
Phrase reordering models have also been integrated into hierarchical phrase-based SMT. Phrase reordering models were originally developed for phrase-based SMT (Koehn et al., 2005;Zens and Ney, 2006;Ni et al., 2009; and could not be used in hierarchical phrase-based model directly. Nguyen and Vogel (2013) and Cao et al. (2014) proposed to integrate phrasebased reordering features into hierarchical phrasebased SMT. However, their work limited to learning the reordering of continuous phrases. For short phrases, in extreme cases, when phrase length is one, their model only learned reordering for continuous word pairs like Feng et al. (2013)'s work, while our model can be applied to word pairs with longer distances.

Our Approach
Let e m 1 = e 1 , . . . , e m be a target translation of f l 1 = f 1 , . . . , f l and A be word alignments between e m 1 and f l 1 , our model estimates the reordering probability of the source sentence as follows: where Pr f l 1 , e m 1 , A, i, j is the reordering probability of the word pair f i , f j during translating; N is the maximum distance for source word reordering, which is empirically determined by supposing that estimating reorderings longer than N does not improve translation performance any more.
Previous word reordering models (Tromble and Eisner, 2009;Feng et al., 2013) consider the reordering of a source word pair to be reversed or not. When a source word is aligned to several uncontinuous target words, it can be hard to determine if a word pair is reversed or not. They solved this problem by only using one alignment from multiple alignments and ignoring the others. In contrast, our model handles all alignments as shown below.
Suppose that f i is aligned to π i (π i ≥ 0) target words. When π i > 0, {a ik |1 ≤ k ≤ π i } stands for the positions of target words aligned to f i . If π i = 0 or π j = 0, Pr f l 1 , e m 1 , A, i, j = 1, otherwise, We train a series of sub-models, for fi−3, ..., fj+3, ea iu , ea jv , 1 is a positive instance for Mj−i to learn reorderings for word pairs with different distances. That means, for the word pair f i , f j with distance j − i = n, its reordering probability Pr o ijuv |f i−3 , ..., f j+3 , e a iu , e a jv is estimated by M n . Different sub-models are trained and integrated into the translation system separately.
Each sub-model M n is implemented by an FNN, which has the same structure with the neural language model in (Vaswani et al., 2013). The input to M n is a sequence of n + 9 words: f i−3 , ..., f j+3 , e a iu , e a jv . The input layer projects each word into a high dimensional vector using a matrix of input word embeddings. Two hidden layers can combine all input data 1 . The output layer has two neurons that give Pr o ijuv = 1|f i−3 , ..., f j+3 , e a iu , e a jv and That guy who wears glasses is James Figure 1: A Chinese-English sentence pair.
The backpropagation algorithm is used to train these reordering sub-models. The training instances for each sub-model are extracted from the word-aligned parallel corpus according to Algorithm 1. For example, the word pair "戴(wears) 男生(guy)" in Figure 1 will be extracted as a positive instance for M 3 . The input of this instance is as follows: "<s> <s> 那个 戴 眼镜 的 男生 是 詹姆士 </s> wears guy", where <s> and </s> represent the beginning and ending of a sentence. If a word never occurs or only occurs once in training corpus, we replace it with a special symbol <unk>.

Integration into the Decoder
In the hierarchical phrase-based model, a translation rule r is like: where X is a nonterminal, γ and α are respectively source and target strings of terminals and nonterminals, and ∼ is the alignment between nonterminals and terminals in γ and α.
Each rule has several features and the feature weights are tuned by the minimum error rate training (MERT) algorithm (Och, 2003). To integrate our model into the hierarchical phrase-based translation system, a new feature score n (r) is added to each rule r for each M n . The score of this feature is calculated during decoding. Note that these scores are correspondingly calculated for different sub-models M n and the sub-model weights are tuned separately.
Suppose that r is applied to the input sentence For example, if a rule "X1 X2 男生→ X1 guy X2" is applied to the input sentence in Figure 1, then 2 , 1, 3 , 1, 4 , 1, 5 , 2, 5 , 3, 5 , 4, 5 One concern in using target features is the computational efficiency, because reordering probabilities have to be calculated during decoding. So we cache probabilities to reduce the expensive neural network computation in experiments.

Experiments
We evaluated the proposed approach for Chineseto-English (CE) and Japanese-to-English (JE) translation tasks. The official datasets for the patent machine translation task at NTCIR-9 (Goto et al., 2011) were used. The detailed statistics for training, development and test sets are given in Ta  In NTCIR-9, the development and test sets were both provided for CE task while only the test set was provided for the JE task. Therefore, we used the sentences from the NTCIR-8 JE test set as the development set for JE task. The word segmentation was done by BaseSeg (Zhao et al., 2006;Zhao and Kit, 2008;Zhao and Kit, 2011;Zhao et al., 2013) for Chinese and Mecab 2 for Japanese.
To learn neural reordering models, the training and development sets were put together to obtain symmetric word alignments using GIZA++ (Och and Ney, 2003) and the grow-diag-finaland heuristic (Koehn et al., 2003). The reordering instances extracted from the aligned training and development sets were used as the training and validation data respectively for learning neural reordering models. Neural reordering models were trained by the toolkit NPLM (Vaswani et al., 2013). For CE task, training instances extracted from all the 1M sentence pairs were used to train neural reordering models. For JE task, training instances were from 1M sentence pairs that were randomly selected from all the 3.14M sentence pairs.
We also implemented Hayashi et al. (2010)'s model for comparison. The training instances for their model were extracted from the same sentence pairs as ours.  (Koehn, 2004) w.r.t. BLEU scores. The symbol represents a significant difference at the p < 0.01 level; > represents a significant difference at the p < 0.05 level; − means not significantly different at p = 0.05. For each translation task, the recent version of the Moses hierarchical phrase-based decoder (Koehn et al., 2007) with the training scripts was used as the baseline system Base. We used the default parameters for Moses. A 5-gram language model was trained on the target side of the training corpus by IRST LM Toolkit 3 with the improved Kneser-Ney smoothing.
We integrated our reordering models into Base. Table 2 gives detailed translation results. "Hayashi model" represents the method of (Hayashi et al., 2010). "M j 1 (j = 1, 2, 3, 4)" means that Base was augmented with the reordering scores calcuated from a series of sub-models M 1 to M j .
As shown in Table 2, integrating only M 1 , which predicts reordering for two continuous source words, has already given BLEU improvement 1.8% and 1.2% over baseline on CE and JE, respectively. As more sub-models for longer distance reordering being integrated, the translation performance improved consistently, though the improvement leveled off quickly. For CE and JE tasks, M n with n ≥ 3 and n ≥ 4, respectively, cannot give further performance improvement at any significant level.
Why did the improvement level off quickly?  In other words, why do long distance reordering models have a much less leverage over translation performance than short ones? First, the prediction accuracy decreases as the reordering distance increasing. Table 3a gives classification accuracies on the validation data for each sub-model. The reason for accuracy decreasing is that the input size of sub-model grows as reordering distance increasing. Namely, long distance reordering needs to consider more complicated context. Second, we attribute the influence decrease of the longer reordering models to the redundancy of the predictions among different reordering models. For example, in Figure 1, when word pairs "男生(guy) 是(is)" and "是(is) 詹姆士(James)" are both predicted to be not reversed, the reordering for "男生(guy) 詹姆士(James)" can be logically determined to be not reversed without further reordering model prediction. That means, sometimes, a long distance word reordering can be determined by a series of shorter word reordering pairs. But still, some predictions for longer reordering are useful. For example, the reordering of "戴(wears) 男生(guy)" cannot be determined when "戴(wears) 眼镜(glasses)" is predicted to be not reversed and "眼镜(glasses) 男生(guy)" is reversed. This is the reason why translation performance improves as more sub-models being integrated.
As shown in Table 2, with 4 sub-models being integrated, our model improved baseline system significantly and also outperformed Hayashi model clearly. It is easy to understand, since our model was trained by feed-forward neural network on a high dimensional space and incorporated rich context information, while Hayashi model used the averaged perceptron algorithm and simple features. Table 3b shows the prediction accuracies of Hayashi model. Note that Hayashi model predicts reorderings for all word pairs, but only prediction accuracies for word pairs with distance 4 or less are shown. Compared with Table 3a, the prediction accuracy of our model is much higher than Hayashi model. Actually, FNN is not suitable for Hayashi model since the computation cost for Hayashi model is quite expensive. Using FNN to reorder all word pairs could cost nearly one minute to translate one sentence according to our experiments, while integrating 4 sub-models only cost 10 seconds 4 . Compared with Hayashi model, our model not only speeds up decoding time but also reduces the training time. Training for Hayashi model is much slower since word pairs with all different distances are used as training data. By using separate sub-models, we can train each sub-model one by one and stop when translation performance cannot be improved any more. However, despite of efficiency, one unified model will theoretically have better performance than separate sub-models since separate sub-models do not share training instances and the unified model will suffer less from data sparsity. So, we did some extra experiments and trained a neural network which had the same structure as M 4 to learn reorderings for all word pairs with distance 4 or less, instead of using 4 separate neural networks. A specific word null was used since word pairs with distance 1,2,3 do not have enough inputs for M 4 . The significance test results showed that translation performance had no significant difference between one unified model and multiple sub-models. This is because the training corpus for our model is quite large, so separate training sets are sufficient for each submodel to learn the reorderings well. Besides, using neural networks to learn these sub-models on a continuous space can relieve the data sparsity problem to some extent.
Note that if we only integrate M 4 into Base, the translation quality of Base can be improved in our preliminary experiments. But M 4 cannot predict reorderings for word pairs with distance less than 4. So M 3 1 will be still needed for predicting reorderings of word pairs with distance 1,2,3. But after M 3 1 being integrated, M 4 will not be needed due to the redundancy of the predictions among 4 Note that cache was used in all our experiments to reduce the expensive neural network computation cost and turned out to be very useful. Without caching, integrating 4 sub-models could cost nearly 7 minutes to translate a sentence. different reordering models.

Conclusion
In this paper, we propose to enhance hierarchical phrase-based SMT by training a series of separate sub-models to learn reorderings for word pairs with distances less than a specific threshold, based on the experimental fact that longer distance reordering models are not quite helpful for translation quality. Compared with Hayashi et al. (2010)'s work, our model is much more efficient and keeps all helpful word reordering information. Besides, our reordering model is learned by feed-forward neural network and incorporates rich context information for better performance. On both Chinese-to-English and Japanese-to-English translation tasks, the proposed model outperforms the previous model significantly. Hai Zhao, Masao Utiyama, Eiichiro Sumita, and Bao-Liang Lu. 2013. An empirical study on word segmentation for chinese machine translation. In Computational Linguistics and Intelligent Text Processing, pages 248-263.