Rule Selection with Soft Syntactic Features for String-to-Tree Statistical Machine Translation

In syntax-based machine translation, rule selection is the task of choosing the correct target side of a translation rule among rules with the same source side. We de-ﬁne a discriminative rule selection model for systems that have syntactic annotation on the target language side (string-to-tree). This is a new and clean way to integrate soft source syntactic constraints into string-to-tree systems as features of the rule selection model. We release our implementation as part of Moses.


Introduction
Syntax-based machine translation is well known for its ability to handle non-local reordering. Syntax-based models either use linguistic annotation on the source language side (Huang, 2006;Liu et al., 2006), target language side (Galley et al., 2004;Galley et al., 2006) or are syntactic in a structural sense only (Chiang, 2005). Recent shared tasks have shown that systems integrating information on the target language side, also called string-to-tree systems, achieve the best performance on several language pairs (Bojar et al., 2014). At the same time, soft syntactic features significantly improve the translation quality of hierarchical systems (Hiero) as shown in (Marton et al., 2012;Chiang, 2010;Liu et al., 2011;Cui et al., 2010). Improving the performance of stringto-tree systems through the integration of soft syntactic constraints on the source language side is therefore an interesting task.
So far, all approaches on this topic include soft syntactic constraints into the rules of string-to-tree (Zhang et al., 2011;Huck et al., 2014) or stringto-dependency (Huang et al., 2013) systems and define heuristics to determine to what extent these constituents match the syntactic structure of the source sentence. We propose a novel way to integrate soft syntactic constraints into a string-totree system. We define a discriminative rule selection model for string-to-tree machine translation. We consider rule selection as a multi-class classification problem where the task is to select the correct target side of a rule given its source side as well as contextual information about the source sentence and the considered rule. So far, such models have been applied to systems without syntactic annotation on the target language side. , He et al. (2010) and Cui et al. (2010) apply such rule selection models to hierarchical machine translation,  to tree-to-string systems and Zhai et al. (2013) to systems based on predicate argument structures. When target side syntactic annotations are taken into account, the task of rule selection has to be reformulated (see Section 2) while the same type of model can be used in approaches without target annotations. This work is the first attempt to define a rule selection model for a string-to-tree system. We make our implementation publicly available as part of Moses. 1 We show in Section 2 that string-to-tree rule selection is different from the hierarchical case addressed by previous work and define our rule selection model. In Section 3 we present the training procedure before providing a proof-of-concept evaluation in Section 4.
2 Rule selection for string-to-tree SMT 2.1 String-to-tree machine translation We present string-to-tree machine translation as implemented in Moses (which is the framework that we use). String-to-tree rules have the form X/A → α, γ, ∼ . On the source language side, Ces cellules présentent plusieurs caractéristiques spécifiques all non-terminals have the unique label X while on the target language side non-terminals are annotated with syntactic labels n t ∈ N t . The lefthand side X/A consists of source and target nonterminals. In the right hand side (rhs), α is a string of source terminal symbols and the non-terminal X. The string γ consists of target terminals and non-terminals n t ∈ N t . The alignment ∼ is a oneto-one correspondence between source and target non-terminal symbols. String-to-tree rules are extracted from pairs of strings and trees as exemplified in Figure 1. Rules r 1 and r 2 are example rules extracted from this data.
(r1) X/N P → X1 caractéristiques X2, JJ1 JJ2 characteristics (r2) X/N P → X1 caractéristiques X2, N N S1 characteristic JJ2 During decoding, CYK+ chart parsing (Chappelier et al., 1998) with cube pruning and language model scoring is performed on an input sentence such as F below. Each time a rule is applied to the input sentence, candidate target trees are built. Figure 2 shows the partial translations built after the segments Diverses and importantes have been decoded. Given these partial translations, rule r 1 can be applied in a further decoding step.

String-to-tree rule selection
Rule selection is the problem of selecting the rule with the correct target side among rules with the same source side. For hierarchical machine translation (Hiero), the rule selection problem consists of choosing, among r 3 and r 4 , the rule that correctly applies to F (r 3 in our example).
Rule selection models disambiguate between these rules using context information about the source sentence and the shape of the rules.
In string-to-tree machine translation, the rule selection problem is different. Because the decoding process is guided by target side syntactic annotation, partial trees built during decoding must be considered when new rules are applied. For instance, when a rule is selected to translate sentence F given the partial translations in Figure 2, then the non-terminals in the target side of this rule must match the constituents selected so far. Consequently, rules r 1 and r 2 (Section 2.1) are not competing during rule selection. 2 Competing rules for r 1 would be r 5 and r 6 below.
For consistency with decoding, we redefine the rule selection problem for the string-to-tree case. In this setup, it is the task of disambiguating rules with the same source side and aligned target nonterminals. As a consequence, our rule selection model (presented next) is not only normalized over the source rhs of the rules but also takes target non-terminals into account. The default rule scoring procedure for string-to-tree rules implemented in Moses uses the same normalization as we do. However, Williams and Koehn (2012) propose to normalize string-to-tree rules over the source rhs only.

Rule selection model
We denote string-to-tree rules with X/A → α, γ, ∼ , as in Section 2.1. ByÑ t t , we denote target non-terminals with their alignment to source non-terminals. 3 C(f, α) is context information in the source sentence f and the source side α. R(α, γ) represents features on string-totree rules. The rule selection model estimates P (γ | C(f, α), R(α, γ), α,Ñ t t ) and is normalized over the set G of candidate target sides γ for a given α andÑ t t . The function GT O : α → G generates, given the source side α and target non-terminalsÑ t t , the set G of all corresponding target sides γ . The estimated distribution can be written as: In the same fashion as (Cui et al., 2010) do for the hierarchical case, we define a global rule selection model instead of a model that is local to the source side of each rule.
To illustrate the feature templates C(f, α) and R(α, γ) of our rule selection model, we suppose that rule r 1 has been extracted from the French sentence in Figure 3. The syntactic features are: -Does α match a constituent: no match -Type of matched constituent: None -Lowest parent of unmatched constituent: NP -Span width covered by α: 3 The rule internal features are: -Source side α: X1 caractéristiques X2 (one feature) -Target side γ: JJ1 JJ2 characteristics -Aligned terminals in α and γ: car-actéristiques↔characteristics -Aligned non-terminals in α and γ: X1↔JJ1 X2↔JJ2 -Best baseline translation probability: Most Frequent Our rule selection model is integrated in the Moses string-to-tree system as an additional feature of the log-linear model. 3 For rule r1,r5 and r6,Ñ tt would be JJ1 and JJ2.

Model Training
We create training examples using the rule extraction procedure in (Williams and Koehn, 2012). 4 We begin by generating a rule-table using this procedure. Then, each time a rule r : X/A → α, γ, ∼ can be extracted from the training data, we generate a new training example. The target side γ of the extracted rule is a positive instance and gets a loss of 0. To generate negative samples, we collect all rules r 2 , . . . , r n that have the same source language side as r as well as the same aligned target non-terminalsÑ t t . Each of these rules is a negative example and gets a cost of 1. As an example, suppose that rule r 1 introduced in Section 2.1 has been extracted from the training example in Figure 1. The target side "JJ 1 JJ 2 characteristics" is a correct class and gets a cost of 0. The target side of all other rules having the same source side and aligned target non-terminals, such as rule r 5 and r 6 , are incorrect classes.
For model training, we use the cost-sensitive one-against-all-reduction (Beygelzimer et al., 2005) of Vowpal Wabbit (VW). 5 We avoid overfitting to training data by employing early stopping once classifier accuracy decreases on a heldout dataset. 6

Experimental Setup
Our baseline system is a syntax-based system with linguistic annotation on the target language side (string-to-tree). We use the version implemented in the Moses open source toolkit (Hoang et al., 2009;Williams and Koehn, 2012) with standard parameters. Rule extraction is performed as in (Galley et al., 2004) with rule composition (Galley et al., 2006;DeNeefe et al., 2007). Non-lexical unary rules are removed (Chung et al., 2011) and scope-3 pruning (Hopkins and Langmead, 2010) is performed. Rule scoring is done using relative frequencies normalized over the source rhs and aligned non-terminals in the target rhs. The contrastive system is the same string-to-tree system but augmented with our rule selection model as a feature of the log-linear model.  Table 2: String-to-tree system evaluation results.
We evaluate the baseline and our global model on three domains: (1) news, (2) medical, and (3) science. The training data for news is taken from Europarl-v4. Development and test sets are from the news translation task of WMT 2009(Callison-Burch et al., 2009). For medical we use the biomedical data from EMEA (Tiedemann, 2009). Since this is a parallel corpus only, we first removed duplicate sentences and then constructed development and test sets by randomly selecting sentence pairs. As training data for science we use the scientific abstracts data provided by Carpuat et al. (2013). Table 1 gives an overview of the corpora sizes.
Berkeley parser (Petrov et al., 2006) is used to parse the English side of each parallel corpus (for string-to-tree rule extraction) as well as for parsing the French source side (for feature extraction). We trained a 5-gram language model on the English side of each training corpus using the SRI Language Modeling Toolkit (Stolcke, 2002). We train the model in the standard way and generate word alignments using GIZA++. After training, we reduced the number of translation rules by only keeping the 30-best rules with the same source side according to the direct rule translation rule probability. Our rule selection model was trained with VW. All systems were tuned using batch MIRA (Cherry and Foster, 2012). We measured the overall translation quality with 4-gram BLEU (Papineni et al., 2002), which was computed on tokenized and lowercased data for all systems. Statistical significance is computed with the pairwise bootstrap resampling technique of Koehn (2004).

Results
Table 2 displays the BLEU scores for our experiments. On science and news, small improvements are achieved while for medical a small decrease is observed. None of these differences is statistically significant.
An analysis of the system outputs for each domain showed that the small improvements are due to the fact that in string-to-tree systems there is not enough ambiguity between competing rules during decoding. To support this conjecture, we first analyzed rule diversity by looking at the negative samples collected during training example acquisition. In a second step, we compared the results of the string-to-tree systems in Table 2 with a system where the translation rules are much more ambiguous. To this aim, we applied our approach to a hierarchical system in the same line as (Cui et al., 2010). Finally, we further tested the ability of our system to disambiguate between competing rules by training a model on the concatenation of all domains.

Analysis of Rule Diversity
The amount of competing rules during decoding can be estimated by looking at the negative samples collected for each training example. This analysis showed that the diversity of rules containing non-terminal symbols is limited. We present rules q 1 to q 3 (taken from science) to illustrate the poor diversity observed in our training examples.
(q1) X/P P → à X1 X2éventail X3, to DT1 JJ2 variety P P3 (q2) X/P P → à X1 X2éventail X3, to DT1 JJ2 range P P3 (q3) X/P P → à X1 X2éventail X3, to DT1 JJ2 array P P3 Rules q 1 to q 3 are the only rules with source sideà X 1 X 2é ventail X 3 . This number is very low given that the source side contains three non-terminal symbols out of which two are adjacent. Moreover, the difference between these rules is limited to the lexical translation oféventail. This lack of diversity is due to the constraint that competing string-to-tree rules must have the same aligned non-terminal symbols, which is taken into account when collecting negative samples. In other words, the ambiguity between translation rules in a stringto-tree system is heavily restricted by the target side syntax. The observed lack of diversity could be minimized by allowing rules with the same source rhs to have different aligned target non-terminals. In this perspective, rule scoring should be done by normalizing over the source rhs only as in Williams and Koehn (2012). The rule selection model in Section 2.3 should then be redefined and normalized over all rules with the same source rhs. Another way to improve rule diversity would be to remove target non-terminals and use preference  grammars as in Huck et al. (2014).

Comparison with Hierarchical Rule Selection
We applied our approach in a hierarchical phrasebased setting (Hiero). To this end, we trained 3 Hiero baseline systems and 3 Hiero systems augmented with our rule selection model on the data given in Section 4.1. The results of these experiments are shown in Table 3. Our augmented system largely outperforms the baselines. Interestingly, hierarchical rule selection significantly helps on the medical and scientific domain but still yields results that are significantly lower than those of the string-to-tree systems. This indicates that systems with target side syntax better disambiguate than hierarchical models with improved rule selection. Overall, we find the results of both types of systems promising and we will consider how to introduce more diversity into the rules of string-to-tree systems.

Concatenation of Training Data
In order to further evaluate the ability of our model to disambiguate string-to-tree rules, we trained a system using the concatenated training data of all 3 domains as presented in Section 4.1. This global model was then used to tune and decode using the development and test data of each domain. The results in Table 4 show that even on concatenated data our rule selection model does not improve over the baseline.  Table 4: String-to-tree system evaluation results with concatenated training data.

Conclusion and future work
We presented the first attempt to define a rule selection model with syntactic features for string-totree machine translation. We have shown that in order to be applied to the string-to-tree case, the rule selection problem must be redefined. An extensive evaluation on French-English translation tasks for different domains has shown that rule selection cannot significantly improve string-to-tree systems. An analysis of rule diversity and an empirical comparison with hierarchical rule selection indicate that the low improvements are due to the fact that the ambiguity between string-to-tree rules is too small to be improved with a rule selection model. In future work, we will use different techniques to improve the diversity of the string-to-tree rules considered during decoding in our system.