A Continuous Space Rule Selection Model for Syntax-based Statistical Machine Translation

One of the major challenges for statistical machine translation (SMT) is to choose the appropriate translation rules based on the sentence context. This paper proposes a continuous space rule selection (CSRS) model for syntax-based SMT to perform this context-dependent rule selection. In contrast to existing maximum entropy based rule selection (MERS) models, which use discrete representations of words as features, the CSRS model is learned by a feed-forward neural network and uses real-valued vector representations of words, allowing for better generalization. In addition, we propose a method to train the rule selection models only on minimal rules, which are more frequent and have richer training data compared to non-minimal rules. We tested our model on different translation tasks and the CSRS model outperformed a base-line without rule selection and the previous MERS model by up to 2 . 2 and 1 . 1 points of BLEU score respectively.


Introduction
In syntax-based statistical machine translation (SMT), especially tree-to-string (Liu et al., 2006;Graehl and Knight, 2004) and forest-to-string (Mi et al., 2008) SMT, a source tree or forest is used as input and translated by a series of tree-based translation rules into a target sentence. A tree-based translation rule can perform reordering and translation jointly by projecting a source subtree into a target string, which can contain both terminals and nonterminals.
One of the difficulties in applying this model is the ambiguity existing in translation rules: a source subtree can have different target translations extracted from the parallel corpus as shown in Figure 1. Selecting correct rules during decoding is a major challenge for SMT in general, and syntax-based models are no exception.
There have been several methods proposed to resolve this ambiguity. The most simple method, used in the first models of tree-to-string translation (Liu et al., 2006), estimated the probability of a translation rule by relative frequencies. For example, in Figure 1, the rule that occurs more times in the training data will have a higher score. Later,  proposed a maximum entropy based rule selection (MERS, Section 2) model for syntax-based SMT, which used contextual information for rule selection, such as words surrounding a rule and words covered by nonterminals in a rule. For example, to choose the correct rule from the two rules in Figure 1 for decoding a particular input sentence, if the source phrase covered by "x1" is "a thief" and this child phrase has been seen in the training data, then the MERS model can use this information to determine that the first rule should be applied. However, if the source phrase covered by "x1" is a slightly different phrase, such as "a gunman", it will be hard for the MERS model to select the correct rule, because it treats "thief" and "gunman" as two different and unrelated words.
In this paper, we propose a continuous space rule selection (CSRS, Section 3) model, which is learned by a feed-forward neural network and replaces the discrete representations of words used in the MERS model with real-valued vector representations of words for better generalization. For example, the CSRS model can use the similarity of word representations for "gunman" and "thief" to infer that "a gunman" is more similar with "a thief" than "a cold".
In addition, we propose a new method, applicable to both the MERS and CSRS models, to train rule selection models only on minimal rules. These minimal rules are more frequent and have richer training data compared to non-minimal rules, making it possible to further relieve the data sparsity problem.
In experiments (Section 4), we validate the proposed CSRS model and the minimal rule training method on English-to-German, Englishto-French, English-to-Chinese and English-to-Japanese translation tasks.
2 Tree-to-String SMT and MERS 2.1 Tree-to-String SMT In tree-to-string SMT (Liu et al., 2006), a parse tree for the source sentence F is transformed into a target sentence E using translation rules R. Each tree-based translation rule r ∈ R translates a source subtreet into a target stringẽ, which can contain both terminals and nonterminals. During decoding, the translation system examines different derivations for each source sentence and outputs the one with the highest probability, For a translation E of a source sentence F with derivation R, the translation probability is calcu-lated as follows, Here, h k are features used in the translation system and λ k are feature weights. Features used in Liu et al. (2006)'s model contain a language model and simple features based on relative frequencies, which do not consider context information.
One of the most important features used in this model is based on the log conditional probability of the target string given the input source subtree log Pr ẽ|t . This allows the model to determine which target strings are more likely to be used in translation. However, as the correct translation of the rules may depend on context that is not directly included in the rule, this simple contextindependent estimate is inherently inaccurate.

Maximum Entropy Based Rule Selection
To perform context-dependent rule selection,  proposed the MERS model for syntax-based SMT. They built a maximum entropy classifier for each ambiguous source subtreẽ t, which introduced contextual information C and estimated the conditional probability using a loglinear model as shown below, . ( The target stringsẽ are treated as different classes for the classifier. Supposing that, • r covers source span [f ϕ , f ϑ ] and target span [e γ , e σ ], the MERS model used 5 kinds of source-side features as follows, 1. Lexical features: words around a rule (e.g. f ϕ−1 ) and words covered by nonterminals in a rule (e.g. f ϕ 0 ).
2. Part-of-speech features: part-of-speech (POS) of context words that are used as lexical features.
3. Span features: span lengths of source phrases covered by nonterminals in r.
4. Parent features: the parent node oft in the parse tree of the source sentence.
5. Sibling features: the siblings of the root oft.
Note that the MERS model does not use features of the source subtreet, because the source subtreet is fixed for each classifier.
The MERS model was integrated into the translation system as two additional features in Equation 2. Supposing that the derivation R contains M rules r 1 , ..., r M with ambiguous source subtrees, then these two MERS features are as follows, wheret m andẽ m are the source subtree and the target string contained in r m , and C m is the context of r m . h 1 is the MERS probability feature, and, h 2 is a penalty feature counting the number of predictions made by the MERS model.

Modeling
The proposed CSRS model differs from the MERS model in three ways.
1. Instead of learning a single classifier for each source subtreet, it learns a single classifier for all rules.
2. Instead of hand-crafted features, it uses a feed-forward neural network to induce features from context words.
3. Instead of one-hot representations, it uses distributed representations to exploit similarities between words.
First, with regard to training, our CSRS model follows Zhang et al. (2015) in approximating the posterior probability by a binary classifier as follows, where v ∈ {0, 1} is an indicator of whethert is translated intoẽ. This is in contrast to the MERS model, which treated the rule selection problem as a multi-class classification task. If instead we attempted to estimate output probabilities for all differentẽ, the cost of estimating the normalization coefficient would be prohibitive, as the number of unique output-side word stringsẽ is large.
There are a number of remedies to this, including noise contrastive estimation (Vaswani et al., 2013), but the binary approximation method has been reported to have better performance (Zhang et al., 2015). To learn this model, we use a feed-forward neural network with structure similar to neural network language models (Vaswani et al., 2013). The input of the neural rule selection model is a vector representation fort, another vector representation forẽ, and a set of ξ vector representations for both source-side and target-side context words of r: In our model, C (r) is calculated differently depending on the number of nonterminals included in the rule. Specifically, Equation 7 defines C out (r, n) to be context words (n-grams) around r and C in (r, n, X k ) to be boundary words (n-grams) covered by nonterminal X k in r. 1 The context words used for a translation rule r with K nonterminals are shown as below.
We can see that rules with different numbers of nonterminals K use different context words. 2 S The blue rug on the floor of your apartment is really cute : area covered by r : area covered by x0 in r For example, if r does not contain nonterminals, then C in is not used. Besides, we use more context words surrounding the rule (C out (r, 6)) for rules with K = 0 than rules that contain nonterminals (C out (r, 4) for K = 1 and C out (r, 2) for K > 1). This is based on the intuition that rules with K = 0 can only use the context words surrounding the rule as information for rule selection, hence this information is more important than for other rules. Figure 2 gives an example of context words when applying the rule r to the example sentence.
Note that we use target-side context because source-side context is not enough for selecting correct rules. Since it is not uncommon for one source sentence to have different correct translations, a translation rule used in one correct derivation may be incorrect for other derivations. In these cases, target-side context is useful for selecting appropriate translation rules. 3 The vector representations fort,ẽ and C are obtained by using a projection matrix to project each one-hot input into a real-valued embedding vector. This projection is another key advantage over the MERS model. Because the CSRS model learns one unified model for all rules and can share all training data to learn better vector representations of words and rules, and the similarities between vectors can be used to generalize in cases such as the "thief/gunman" example in the introduction.
After calculating the projections, two hidden layers are used to combine all inputs. Finally, the neural network has two outputs Pr v = 1|ẽ,t, C and Pr v = 0|ẽ,t, C .
To train the CSRS model, we need both positive and negative training examples. Positive examples, ẽ,t, C, 1 , can be extracted directly from the parallel corpus. For each positive example, we generate one negative example, ẽ ,t, C, 0 . Here, e is randomly generated according to the translation distribution (Zhang et al., 2015), where, Count ẽ,t is how many timest is translated intoẽ in the parallel corpus. During translating, following the MERS model, the CSRS model only calculates probabilities for rules with ambiguous source subtrees. These predictions are converted into two CSRS features for the translation system similar to the two MERS features in Equation 4: one is the product of probabilities calculated by the CSRS model and the other one is a penalty feature that stands for how many rules with ambiguous source subtrees are contained in one translation.

Usage of Minimal Rules
Despite the fact that the CSRS model can share information among instances using distributed word representations, it still poses an extremely sparse learning problem. Specifically, the numbers of unique subtreest and stringsẽ are extremely large,  Figure 3: Rules. and many may only appear a few times in the corpus. To reduce these problems of sparsity, we propose another improvement to the model, specifically through the use of minimal rules. Minimal rules (Galley et al., 2004) are translation rules that cannot be split into two smaller rules. For example, in Figure 3, Rule2 is not a minimal rule, since Rule2 can be split into Rule1 and Rule3. In the same way, Rule4 and Rule6 are not minimal while Rule1, Rule3 and Rule5 are minimal.
Minimal rules are more frequent than nonminimal rules and have richer training data. Hence, we can expect that a rule selection model trained on minimal rules will suffer less from data sparsity problems. Besides, without non-minimal rules, the rule selection model will need less mem-ory and can be trained faster.
To take advantage of this fact, we train another version of the CSRS model (CSRS-MINI) over only minimal rules. The probability of a nonminimal rule is then calculated using the product of the probability of minimal rules contained therein.
Note that for both the standard CSRS and CSRS-MINI models, we use the same baseline translation system which can use non-minimal translation rules. The CSRS-MINI model will break translation rules used in translations down into minimal rules and multiply all probabilities to calculate the necessary features.

Setting
We evaluated the proposed approach for Englishto-German (ED), English-to-French (EF), English-to-Chinese (EC) and English-to-Japanese (EJ) translation tasks. For the ED and EF tasks, the translation systems are trained on Europarl v7 parallel corpus and tested on the WMT 2015 translation task. 4 The test sets for the WMT 2014 translation task were used as development sets in our experiments. For the EC and EJ tasks, we used datasets provided for the patent machine translation task at NTCIR-9 (Goto et al., 2011). 5 The detailed statistics for training, development and test sets are given in Table 1. The word segmentation was done by BaseSeg (Zhao et al., 2006) for Chinese and Mecab 6 for Japanese.
For each translation task, we used Travatar (Neubig, 2013) to train a forest-to-string translation system. GIZA++ (Och and Ney, 2003) was used for word alignment. A 5-gram language model was trained on the target side of the training corpus using the IRST-LM Toolkit 7 with modified Kneser-Ney smoothing. Rule extraction was 4 The WMT tasks provided other training corpora. We used only the Europarl corpus, because training a large-scale system on the whole data set requires large amounts of time and computational resources. 5 Note that NTCIR-9 only contained a Chinese-to-English translation task. Because we want to test the proposed approach with a similarly accurate parsing model across our tasks, we used English as the source language in our experiments. In NTCIR-9, the development and test sets were both provided for the CE task while only the test set was provided for the EJ task. Therefore, we used the sentences from the NTCIR-8 EJ and JE test sets as the development set in our experiments. performed using the GHKM algorithm (Galley et al., 2006) and the maximum numbers of nonterminals and terminals contained in one rule were set to 2 and 10 respectively. Note that when extracting minimal rules, we release this limit. The decoding algorithm is the bottom-up forest-to-string decoding algorithm of Mi et al. (2008). For English parsing, we used Egret 8 , which is able to output packed forests for decoding. We trained the CSRS models (CSRS and CSRS-MINI) on translation rules extracted from the training set. Translation rules extracted from the development set were used as validation data for model training to avoid over-fitting. For different training epochs, we resample negative examples for each positive example to make use of different negative examples. The embedding dimension was set to be 50 and the number of hidden nodes was 100. The initial learning rate was set to be 0.1. The learning rate was halved each time the validation likelihood decreased. The number of epoches was set to be 20. A model was saved after each epoch and the model with highest validation likelihood was used in the translation system.
We implemented    ing instances for their model were extracted from the training set. Following their work, the iteration number was set to be 100 and the Gaussian prior was set to be 1. We also compared the original MERS model and the MERS model trained only on minimal rules (MERS-MINI) to test the benefit of using minimal rules for model training.
The MERS and CSRS models were both used to calculate features used to rerank unique 1,000best outputs of the baseline system. Tuning is performed to maximize BLEU score using minimum error rate training (Och, 2003). Table 2 shows the translation results and Table 3 shows significance test results using bootstrap resampling (Koehn, 2004): "Base" stands for the baseline system without any; "MERS", "CSRS", "MERS-MINI" and "CSRS-MINI" means the outputs of the baseline system were reranked using features from the MERS, CSRS, MERS-MINI and CSRS-MINI models respectively. Generally, the CSRS model outperformed the MERS model and the CSRS-MINI model outperformed the MERS-MINI model on different translation tasks. In addition, using minimal rules for model training benefitted both the MERS and CSRS models. Table 4 shows translation examples in the EC task to demonstrate the reason why our approach improved accuracy. Among all translations, T CSRS−M IN I is basically the same as the reference with only a few paraphrases that do not alter the meaning of the sentence. In con-Source typical dynamic response rate of an optical gap sensor as described above is approximately 2 khz , or 0.5 milliseconds .

Results
Reference   Table 5: Rules used to translate the source word "optical" in different translations. Shadows (R 3 ) stand for ambiguous rules.
trast, T Base , T M ERS , T CSRS and T M ERS−M IN I all contain apparent mistakes. For example, the source phrase "optical gap sensor" (covered by gray shadows in Table 4) is wrongly translated in T Base , T M ERS , T CSRS and T M ERS−M IN I due to incorrect reorderings. Table 5 shows rules used to translate the source word "optical" in different translations: R 1 is used in T M ERS and T CSRS ; R 2 is used in T M ERS−M IN I ; R 3 is used in T CSRS−M IN I . Although the source word "optical" is translated to the correct translation "光学(optical)" in all translations, R 1 , R 2 and R 3 cause different reorderings for the source phrase "optical gap sensor". R 3 reorders this source phrase correctly while R 1 and R 2 cause wrong reorderings for this source phrase.
We can see that R 1 is umambiguous, so the MERS and CSRS models will give probability 1 to R 1 , which could make the MERS and CSRS models prefer T M ERS and T CSRS . This is a typical translation error caused by sparse rules since the source subtree in R 1 does not have other translations in the training corpus.
To compare the MERS-MINI and CSRS-MINI models, Table 6 shows minimal rules (R 2a , R 2b , R 3a and R 3b ) contained in R 2 and R 3 . Table 7 shows probabilities of these minimal rules calculated by the MERS-MINI and CSRS-MINI models respectively. We can see that the CSRS-MINI model gave higher scores for the correct translation rules R 3a and R 3b than the MERS-MINI model, while the MERS-MINI model gave a higher score to the incorrect rule R 2b than the CSRS-MINI model.
Note that R 2b and R 3b are the same rule, but the target-side context in T M ERS−M IN I and T CSRS−M IN I is different. The CSRS-MINI model will give R 2b and R 3b different scores because the CSRS-MINI model used target-side context. However, the MERS-MINI model only used source-side features and gave R 2b and R 3b the same score. The fact that the CSRS-MINI model R2a PP ( IN ( "of" ) NP ( NP ( DT ( "an" ) NP' ( x0:JJ x1:NP' ) ) x2:SBAR ) ) → x1 "的" x0 x2 "的" R 2b JJ ( "optical" ) → "光学(optical)"

R3a
NP' ( x0:JJ x1:NP' ) → x0 x1 R 3b JJ ( "optical" ) → "光学(optical)"  gave a higher score for R 3b than R 2b means that the CSRS-MINI model predicted the target string in R 2b and R 3b is a good translation in the context of T

Analysis
To analyze the influence of different features, we trained the MERS model using source-side and target-side n-gram lexical features similar to the CSRS model. When using this feature set, the performance of the MERS model dropped significantly. This indicates that the syntactic, POS and span features used in the original MERS model are important for their model, since these features can generalize better. Purely lexical features are less effective due to sparsity problems when training one maximum entropy based classifier for each ambiguous source subtree and training data for each classifier is quite limited. In contrast, the CSRS model is trained in a continuous space and does not split training data, which relieves the sparsity problem of lexical features. As a result, the CSRS model achieved better performance us-ing only lexical features compared to the MERS model. We also tried to use pre-trained word embedding features for the MERS model, but it did not improve the performance of the MERS model, which indicates that the log-linear model is not able to benefit from distributed representations as well as the neural network model. We also tried reranking with both the CSRS and MERS models added as features, but it did not achieve further improvement compared to only using the CSRS model. This indicates that although these two models use different type of features, the information contained in these features are similar. For example, the POS features used in the MERS model and the distributed representations used in the CSRS model are both used for better generalization.
In addition, using both the CSRS and CSRS-MINI models did not improve over using only the CSRS-MINI model in our experiments. There are two main differences between the CSRS and CSRS-MINI models. First, minimal rules are more frequent and have more training data than non-minimal rules, which is why the CSRS-MINI model is more robust than the CSRS model. Second, non-minimal rules contain more information than minimal rules. For example, in Figure 3, Rule4 contains more information than Rule1, which could be an advantage for rule selection. However, the information contained in Rule4 will be considered as context features for Rule1. Therefore, this is no longer an advantage for the CSRS model as long as we use rich enough context features, which could be the reason why using both the CSRS and CSRS-MINI models cannot further improve the translation quality compared to using only the CSRS-MINI model.

Related Work
The rule selection problem for syntax-based SMT has received much attention.  proposed a lexicalized rule selection model to perform context-sensitive rule selection for hierarchical phrase-base translation. Cui et al. (2010) introduced a joint rule selection model for hierarchical phrase-based translation, which also approximated the rule selection problem by a binary classification problem like our approach. However, these two models adopted linear classifiers similar to those used in the MERS model , which suffers more from the data sparsity problem compared to the CSRS model.
There are also existing works that exploited neural networks to learn translation probabilities for translation rules used in the phrase-based translation model. Namely, these methods estimated translation probabilities for phrase pairs extracted from the parallel corpus. Schwenk (2012) proposed a continuous space translation model, which calculated the translation probability for each word in the target phrase and then multiplied the probabilities together as the translation probability of the phrase pair. Gao et al. (2014) and Zhang et al. (2014) proposed methods to learn continuous space phrase representations and use the similarity between the source and target phrases as translation probabilities for phrase pairs. All these three methods can only be used for the phrase-based translation model, not for syntaxbased translation models.
There are also works that used minimal rules for modeling. Vaswani et al. (2011) proposed a rule Markov model using minimal rules for both training and decoding to achieve a slimmer model, a faster decoder and comparable performance with using non-minimal rules. Durrani et al. (2013) proposed a method to model with minimal translation units and decode with phrases for phrasebased SMT to improve translation performances. Both of these two methods do not use distributed representations as used in our model for better generalization.
In addition, neural machine translation (NMT) has shown promising results recently (Sutskever et al., 2014;Bahdanau et al., 2014;Luong et al., 2015a;Jean et al., 2015;Luong et al., 2015b). NMT uses a recurrent neural network to encode the whole source sentence and then produce the target words one by one. These models can be trained on parallel corpora and do not need word alignments to be learned in advance. There are also neural translation models that are trained on word-aligned parallel corpus (Devlin et al., 2014;Meng et al., 2015;Zhang et al., 2015;Setiawan et al., 2015), which use the alignment information to decide which parts of the source sentence are more important for predicting one particular target word. All these models are trained on plain source and target sentences without considering any syntactic information while our neural model learns rule selection for tree-based translation rules and makes use of the tree structure of natural language for better translation. There is also a new syntactic NMT model (Eriguchi et al., 2016), which extends the original sequence-to-sequence NMT model with the source-side phrase structure. Although this model takes source-side syntax into consideration, it still produces target words one by one as a sequence. In contrast, the tree-based translation rules used in our model can take advantage of the hierarchical structures of both source and target languages.

Conclusion
In this paper, we propose a CSRS model for syntax-based SMT, which is learned by a feedforward neural network on a continuous space. Compared with the previous MERS model that used discrete representations of words as features, the CSRS model uses real-valued vector representations of words and can exploit similarity information between words for better generalization. In addition, we propose to use only minimal rules for rule selection to further relieve the data sparsity problem, since minimal rules are more frequent and have richer training data. In our experiments, the CSRS model outperformed the previous MERS model and the usage of minimal rules benefitted both CSRS and MERS models on different translation tasks.
For future work, we will explore more sophisticated features for the CSRS model, such as syntactic dependency relationships and head words, since only simple lexical features are used in the current incarnation.