Recursive Neural Network Based Preordering for English-to-Japanese Machine Translation

The word order between source and target languages significantly influences the translation quality. Preordering can effectively address this problem. Previous preordering methods require a manual feature design, making language dependent design difficult. In this paper, we propose a preordering method with recursive neural networks that learn features from raw inputs. Experiments show the proposed method is comparable to the state-of-the-art method but without a manual feature design.


Introduction
The word order between source and target languages significantly influences the translation quality in statistical machine translation (SMT) (Tillmann, 2004;Hayashi et al., 2013;Nakagawa, 2015). Models that adjust orders of translated phrases in decoding have been proposed to solve this problem (Tillmann, 2004;Koehn et al., 2005;Nagata et al., 2006). However, such reordering models do not perform well for long-distance reordering. In addition, their computational costs are expensive. To address these problems, preordering (Xia and McCord, 2004;Wang et al., 2007;Xu et al., 2009;Isozaki et al., 2010b;Gojun and Fraser, 2012;Nakagawa, 2015) and postordering (Goto et al., 2012(Goto et al., , 2013Hayashi et al., 2013) models have been proposed. Preordering reorders source sentences before translation, while post-ordering reorders sentences translated without considering the word order after translation. In particular, preordering effectively improves the translation quality because it solves long-distance reordering and computational complexity issues (Jehl et al., 2014;Nakagawa, 2015).
Rule-based preordering methods either manually create reordering rules (Wang et al., 2007;Xu et al., 2009;Isozaki et al., 2010b;Gojun and Fraser, 2012) or extract reordering rules from a corpus (Xia and McCord, 2004;Genzel, 2010). On the other hand, studies in (Neubig et al., 2012;Lerner and Petrov, 2013;Hoshino et al., 2015;Nakagawa, 2015) apply machine learning to the preordering problem. Hoshino et al. (2015) proposed a method that learns whether child nodes should be swapped at each node of a syntax tree. Neubig et al. (2012) and Nakagawa (2015) proposed methods that construct a binary tree and reordering simultaneously from a source sentence. These methods require a manual feature design for every language pair, which makes language dependent design costly. To overcome this challenge, methods based on feed forward neural networks that do not require a manual feature design have been proposed (de Gispert et al., 2015;Botha et al., 2017). However, these methods decide whether to reorder child nodes without considering the sub-trees, which contains important information for reordering.
As a preordering method that is free of manual feature design and makes use of information in sub-trees, we propose a preordering method with a recursive neural network (RvNN). RvNN calculates reordering in a bottom-up manner (from the leaf nodes to the root) on a source syntax tree. Thus, preordering is performed considering the entire sub-trees. Specifically, RvNN learns whether to reorder nodes of a syntax tree 1 with a vector representation of sub-trees and syntactic categories. We evaluate the proposed method for English-to-Japanese translations using both phrase-based SMT (PBSMT) and neural MT (NMT). The results confirm that the proposed method achieves comparable translation quality to the state-of-the-art preordering method (Nakagawa, 2015) that requires a manual feature design.

Preordering with a Recursive Neural Network
We explain our design of the RvNN to conduct preordering after describing how to obtain goldstandard labels for preordering.

Gold-Standard Labels for Preordering
We created training data for preordering by labeling whether each node of the source-side syntax tree has reordered child nodes against a targetside sentence. The label is determined based on Kendall's τ (Kendall, 1938) as in (Nakagawa, 2015), which is calculated by Equation (1).
where y is a vector of target word indexes that are aligned with source words. The value of Kendall's τ is in [−1, 1]. When it is 1, it means the sequence of y is in a complete ascending order, i.e., target sentence has the same word order with the source in terms of word alignment. At each node, if Kendall's τ increases by reordering child nodes, an "Inverted" label is assigned; otherwise, a "Straight" label, which means the child nodes do not need to be reordered, is assigned. When a source word of a child node does not have an alignment, a "Straight" label is assigned.

Preordering Model
RvNN is constructed given a binary syntax tree. It predicts the label determined in Section 2.1 at each node. RvNN decides whether to reorder the child nodes by considering the sub-tree. The vector of the sub-tree is calculated in a bottom-up manner from the leaf nodes. Figure 1 shows an example of preordering of an English sentence "My parents live in London." At the VP node corresponding to "live in London," the vector of the node is calculated by Equation (2), considering its child nodes correspond to "live" and "in London." where f is a rectifier, W ∈ R 2λ×λ is a weight matrix, p l and p r are vector representations of the left and right child nodes, respectively. [·; ·] denotes the concatenation of two vectors. W s ∈ R λ×2 is a weight matrix for the output layer, and b, b s ∈ R λ are the biases. s ∈ R 2 calculated by Equation (3) is a weight vector for each label, which is fed into a softmax function to calculate the probabilities of the "Straight" and "Inverted" labels.
At a leaf node, a word embedding calculated by Equations (4) and (5) is fed into Equation (2).
where x ∈ R N is a one-hot vector of an input word with a vocabulary size of N , W E ∈ R N ×λ is an embedding matrix, and b l ∈ R λ is the bias. The loss function is the cross entropy defined by Equation (6).
where θ is the parameters of the model, n is the node of a syntax tree T , K is a mini batch size, and l n k is the label of the n-th node in the k-th syntax tree in the mini batch.
With the model using POS tags and syntactic categories, we use Equation (7) instead of Equation (2).
where e t represents a vector of POS tags or syntactic categories, W t ∈ R 3λ×λ is a weight matrix, and b t ∈ R λ is the bias. e t is calculated in the same manner as Equations (4) and (5), but the input is a one-hot vector of the POS tags or syntactic categories at each node. λ is tuned on a development set, whose effects are investigated in Section 3.2.

Settings
We conducted English-to-Japanese translation experiments using the ASPEC corpus (Nakazawa et al., 2016). This corpus provides 3M sentence pairs as training data, 1, 790 sentence pairs as development data, and 1, 812 sentence pairs as test data. We used Stanford CoreNLP 2 for tokenization and POS tagging, Enju 3 for parsing of English, and MeCab 4 for tokenization of Japanese.
For word alignment, we used MGIZA. 5 Source-totarget and target-to-source word alignments were calculated using IBM model 1 and hidden Markov model, and they were combined with the intersection heuristic following (Nakagawa, 2015). We implemented our RvNN preordering model with Chainer. 6 The ASPEC corpus was created using the sentence alignment method proposed in (Utiyama and Isahara, 2007) and was sorted based on the alignment confidence scores. In this paper, we used 100k sentences sampled from the top 500k sentences as training data for preordering. The vocabulary size N was set to 50k. We used Adam (Kingma and Ba, 2015) with a weight decay and gradient clipping for optimization. The mini batch size K was set to 500.
We compared our model with the state-of-theart preordering method proposed in (Nakagawa, 2015), which is hereafter referred to as BTG. We used its publicly available implementation, 7 and trained it on the same 100k sentences as our model.
We used the 1.8M source and target sentences as training data for MT. We excluded part of the sentence pairs whose lengths were longer than 50 words or the source to target length ratio exceeded 9. For SMT, we used Moses. 8 We trained the 5-gram language model on the target side of the training corpus with KenLM. 9 Tuning was performed by minimum error rate training (Och,  2003). We repeated tuning and testing of each model 3 times and reported the average of scores. For NMT, we used the attention-based encoderdecoder model of (Luong et al., 2015) with 2-layer LSTM implemented in OpenNMT. 10 The sizes of the vocabulary, word embedding, and hidden layer were set to 50k, 500, and 500, respectively. The batch size was set to 64, and the number of epochs was set to 13. The translation quality was evaluated using BLEU (Papineni et al., 2002) and RIBES (Isozaki et al., 2010a) using the bootstrap resampling method (Koehn, 2004) for the significance test. Figure 2 shows the learning curve of our preordering model with λ = 200. 11 Both the training and the development losses decreased until 2 epochs. However, the development loss started to increase after 3 epochs. Therefore, the number of epochs was set up to 5, and we chose the model with the lowest development loss. The source sentences in the translation evaluation were preordered with 10 http://opennmt.net/ 11 The learning curve behaves similarly for different λ values.

Results
Avogadro 's hypothesis ( 1811 ) contributed to the development in since then Figure 4: Example of a syntax tree with a parse-error (the phrase "(1811)" was divided in two phrases by mistake). Our preordering result was affected by such parse-errors. (Nodes with a horizontal line means "Inverted".) PBSMT    this model. Next, we investigated the effect of λ. Table  1 shows the BLEU scores with different λ values, as well as the BLEU score without preordering. In this experiment, PBSMT was trained with a 500k subset of training data, and the distortion limit was set to 6. Our RvNNs consistently outperformed the plain PBSMT without preordering. The BLEU score improved as λ increased when only word embedding was considered. In addi-tion, RvNNs involving POS tags and syntactic categories achieved even higher BLEU scores. This result shows the effectiveness of POS tags and syntactic categories in reordering. For these models, setting λ larger than 200 did not contribute to the translation quality. Based on these, we further evaluated the RvNN with POS tags and syntactic categories where λ = 200. Table 2 shows BLEU and RIBES scores of the test set on PBSMT and NMT trained on the entire training data of 1.8M sentence pairs. The distortion limit of SMT systems trained using preordered sentences by RvNN and BTG was set to 0, while that without preordering was set to 6. Compared to the plain PBSMT without preordering, both BLEU and RIBES increased significantly with preordering by RvNN and BTG. These scores were comparable (statistically insignificant at p < 0.05) between RvNN and BTG, 12 indicating that the proposed method achieves a translation quality comparable to BTG. In contrast to the case of PB-SMT, NMT without preordering achieved a significantly higher BLEU score than NMT models with preordering by RvNN and BTG. This is the same phenomenon in the Chinese-to-Japanese translation experiment reported in (Sudoh and Nagata, 2016). We assume that one reason is the isolation between preordering and NMT models, where both models are trained using independent optimization functions. In the future, we will investigate this problem and consider a model that unifies preordering and translation in a single model. Figure 3 shows the distribution of Kendall's τ in the original training data as well as the distributions after preordering by RvNN    method learns preordering properly. Furthermore, the ratio of high Kendall's τ by RvNN is more than that of BTG, implying that preordering by RvNN is better than that by BTG.
We also manually investigated the preordering and translation results. We found that our model improved both. Table 3 shows a successful preordering and translation example on PBSMT. The word order is notably different between source and reference sentences. After preordering, the word order between the source and reference sentences became the same. Because RvNN depends on parsing, sentences with a parse-error tended to fail in preordering. For example, the phrase "(1811)" in Figure 4 was divided in two phrases by mistake. Consequently, preordering failed. Table 4 shows preordering and translation examples for the sentence in Figure 4. Compared to the translation without preordering, the translation quality after preordering was improved to deliver correct meaning.

Conclusion
In this paper, we proposed a preordering method without a manual feature design for MT. The experiments confirmed that the proposed method achieved a translation quality comparable to the state-of-the-art preordering method that requires a manual feature design. As a future work, we plan to develop a model that jointly parses and preorders a source sentence. In addition, we plan to integrate preordering into the NMT model.