A Framework for Discriminative Rule Selection in Hierarchical Moses

Training discriminative rule selection models is usually expensive because of the very large size of the hierarchical grammar. Previous approaches reduced the training costs either by (i) using models that are local to the source side of the rules or (ii) by heavily pruning out negative samples. Moreover, all previous evaluations were performed on small scale translation tasks, containing at most 250,000 sentence pairs. We propose two contributions to discriminative rule selection. First, we test previous approaches on two French-English translation tasks in domains for which only limited resources are available and show that they fail to improve translation quality. To improve on such tasks, we propose a rule selection model that is (i) global with rich label-dependent features (ii) trained with all available negative samples. Our global model yields signiﬁcant improvements, up to 1 BLEU point, over previously proposed rule selection models. Second, we successfully scale rule selection models to large translation tasks but have so far failed to produce signiﬁcant improvements in BLEU on these tasks.


Introduction
Hierarchical phrase-based machine translation (Chiang, 2005) performs non-local reordering in a formally syntax-based way. It allows flexible rule extraction and application by using a grammar without linguistic annotation. As a consequence, many hierarchical rules can be used to translate a given input segment even though only a subset of these yield a correct translation. For instance, rules r 1 to r 3 can be applied to translate the French sentence F 1 below although only r 1 yields the correct translation E.
A study on the (interest) X 1 practical (of our approach) X 2 .
E A study on the practical (interest) X 1 (of our approach) X 2 .
The rule scoring heuristics defined by (Chiang, 2005) do not handle rule selection in a satisfactory way and many authors have come up with solutions. Models that use the syntactic structure of the source and target sentence have been proposed by (Marton and Resnik, 2008;Marton et al., 2012;Chiang et al., 2009;Chiang, 2010;Liu et al., 2011). These approaches exclusively take into account syntactic structure and do not model rule selection (see Section 6 for a detailed discussion). Following the work on phrase-sense disambiguation by (Carpuat and Wu, 2007), other authors improve rule selection by defining features on the structure of hierarchical rules and combining these with information about the source sentence (Chan et al., 2007;He et al., 2010;Cui et al., 2010). In these approaches, rule selection is the task of selecting the target side of a rule given its source side as well as contextual information about the source sentence. This task is modeled as a multiclass classification problem.
Because of the very large size of hierarchical grammars, the training procedure for discriminative rule selection models is typically very expensive: multiclass classification is performed over millions of classes (one for each possible target side of a hierarchical rule). To overcome this problem, previous approaches reduced the training costs by either (i) using models that are local to the source side of hierarchical rules or (ii) heavily pruning out negative samples from the training data. (Chan et al., 2007;He et al., 2010) train one (local) classifier for each source side or pattern of hierarchical rules instead of defining a (global) model over all rules. Cui et al. (2010) train global models but in addition to rule table pruning, they heavily prune out negative instances. Finally, in all previous approaches, a small amount of fixed features is used for training and prediction.
While previous approaches have been shown to work on a small 1 English-Chinese news translation task, we show (in Section 4) that on French-English tasks on domains for which only a limited amount of training data is available (which we call low resource tasks), they fail to improve over a hierarchical baseline. This failure is caused by the fact that the models proposed so far do not take advantage of all information available in the training data. Local models prevent feature sharing between rules with different source sides or patterns (see Section 2.3) while aggressive pruning removes important information from the training data (see Section 3.2). On low resource translation tasks, this loss hurts translation quality. Moreover, the small set of features used in previous work does not provide a representation of the training data that is as powerful as it could be for classification (see Section 2.2).
We improve on previous work in two ways. First, we define a global rule selection model with a rich set of feature combinations. Our global model enables feature sharing while the large amount of features we use offers a complete representation of the available training data. We train our model with all acquired training examples. The exhaustive training of a feature rich global model allows us to take full advantage of the training data. We show on two low-resource French-English translation tasks that local and pruned models often fail to improve over a hierarchical baseline while our global model with exhaustive training yields significant improvements on scientific and medical texts (see Section 4). In a second contribution, we successfully scale rule selection models to large scale translation tasks but fail to produce significant improvements in BLEU over a hierarchical baseline on these tasks.
Because our approach needs scaling to a large amount of training examples, we need a classifier that is fast and supports online streaming. We use the high-speed classifier Vowpal Wabbit 2 (VW) which we fully integrate in the syntax component (Hoang et al., 2009) of the Moses machine translation toolkit (Koehn et al., 2007). To allow researchers to replicate our results and improve on our work, we make our implementation publicly available as part of Moses.

Global Rule Selection Model
The goal of rule selection is to choose the correct target side of a hierarchical rule, given a source side as well as other sources of information such as the shape of the rule or its context of application in the source sentence. The latter includes lexical features (e.g. the words surrounding the source span of an applied rule) or syntactic features (e.g. the position of an applied rule in the source parse tree). The rule selection task can be modeled as a multi-class classification problem where each target-side corresponding to a source side gets a label.
Contrary to (Chan et al., 2007;He et al., 2010), we solve the classification problem by building a single global discriminative model instead of using one maximum entropy classifier for each source side or pattern. We solve the rule selection problem through multiclass classification while (Cui et al., 2010) approximate the problem by using a binary classifier.

Model Definition
We denote SCFG rules by X → α, γ , where α is a source and γ a target language string (Chiang, 2005). By C(f, α) we denote information of the source sentence f and the source side α. R(α, γ) represents features on hierarchical rules. Our discriminative model estimates P (γ | α, C(f, α), R(α, γ)) and is normalized over the set G of candidate target sides γ for a given α. The function GT O : α → G generates, given the source side, the set G of all corresponding target sides γ . The estimated distribution can be written as: In the same fashion as for local models, our global model predicts the target side of a rule given its source side and contextual features, meaning that it still disambiguates between rules with the same source side using rich context information. However, because the global model trains a single classifier over all rules, it captures information that is shared among rules with different source sides (see Section 2.3 for more details).

Feature Templates
We now present the feature templates R(α, γ) and C(f, α) in the equation presented in Section 2.1. While in isolation the features composing the templates are similar to the features used in previous work He et al., 2010;Cui et al., 2010), we create powerful representations by dividing our feature set into fixed and labeldependent features and taking the cross product of these.
We begin by presenting the features in our templates. To this aim suppose that rule r 4 has been extracted from sentence F 2 . The 1-best parse tree of F 2 is given in Figure 1.
PONCT . Figure  Lexical features are given in Figure 3 where the term "factored form" denotes the surface form, POS tag and lemma of a word. Syntactic features are in Figure 4. In order to create powerful representations, we combine the features above into more complex templates. To this aim, we distribute our features into two categories: 1. A set of fixed features S on the source sentence context and source side of the rule. 2. A set of features T which varies with the target side of the rule, which we call labeldependent.
The set S includes the lexical and syntactic features in Figures 3 and 4 as well as shape features on the source side α (2 first rows of Figure 2). The set T contains all shape features involving the target side of the rules (5 last rows of Figure 2). Our feature space consists of all source and target features S and T as well as the cross product S × T . The features resulting from the cross product S × T capture many aspects of rule selection that are lost when the features are considered in isolation. For instance, the cross product of (i) the lexical features ( Figure 3) and source word shape features ( Figure 2, row 2) with (ii) the target word shape features ( Figure 2, row 4) create typical templates of a discriminative word lexicon. In the same fashion, the cross product of (i) the syntactic features ( Figure 4) with (ii) the target alignment shape feature (Figure 2, row 6) creates the templates of a reordering model using syntactic features.

Feature Sharing
An advantage of global models over local ones is that they allow feature sharing between rules with different source sides. Through sharing, features that do not depend on the source side of rules but are nevertheless often seen across all rules can be captured. As an illustration, assume that rules r 5 and r 6 have been extracted from sentence F 3 below. The 1-best parse of F 3 is given in Figure 5.
(r 5 ) X → modèles X 1 de bas X 2 , X 1 X 2 models (r 6 ) X → modèles X 1 de X 2 , X 1 X 2 models F 3 Un article sur les modèles (statistiques) X 1 de (bas niveau) X 2 .   A paper on the models (statistical) X 1 of (lowlevel) X 2 Although r 4 , r 5 and r 6 have completely different source sides, they share many contextual features such as: (i) The POS tags of the first and second words to the left of the segment where the rules are applied (which are P and D) (ii) The syntactic structure of this segment (which is that (i) it is not a complete constituent and (ii) it has a NP as its lowest parent) (iii) The rule span width (which is 5)

Exhaustive Model Training
Training examples for our classifier are generated each time a hierarchical rule can be extracted from the parallel corpus (see Section 3.1). This procedure leads to a very large number of training examples. In contrast to (Cui et al., 2010), we do not prune out negative samples and use all available data to train our model.

Training procedure
We create training examples using the rule extraction procedure in (Chiang, 2005). We first extract a rule-table in the standard way. Then, each time a rule a 1 : X → α, γ can be extracted from the parallel corpus, we create a new training example. γ is the correct class and receives a cost of 0. We create incorrect classes using the rules a 2 , . . . , a n in the rule-table that have the same source side as a 1 but different target sides. As an example, suppose that rule r 1 introduced in Section 1 has been extracted from sentence F 1 . The target side "practical X 1 X 2 " is a correct class and gets a cost of 0. The target side of all other rules having the same source side, such as r 2 and r 3 , are incorrect classes.
This process leads to a very large number of training examples, and for each of these we generally have multiple incorrect classes. The total number of training examples for our French-English data sets are displayed in Table 1 (Beygelzimer et al., 2005) of VW. Specifically, the training algorithm which we use is the label dependent version of Cost Sensitive One Against All which uses classification. 3 Two features of VW which are useful for our work are feature hashing and quadratic feature expansion. The quadratic expansion allows us to take the cross-product of the simple source and target features without having to actually write this expansion to disk, which would be prohibitive. Feature hashing (Weinberger et al., 2009) is also important for scaling the classifier to the enormous number of features created by the cross-product expansion.
We avoid overfitting to training data by employing early stopping once classifier accuracy decreases on a held-out dataset. 4 Our model is integrated in the hierarchical framework as an additional feature of the log-linear model.

Training without Pruning of Negative Examples
By not pruning negative samples, we keep important information for model training. As an illustration, consider the example presented above (Sec-tion 3.1) where rule r 1 is a positive instance and r 2 and r 3 are negative samples. The negative instances indicate that in the context of sentence F 1 , the internal features of r 2 and r 3 are not correct. For instance, a piece of information that could be paraphrased into I is lost.
I In the syntactic and lexical context of F 1 the terminal pratique should neither be translated into practice nor into process Consider sentence F 4 , which has a similar context to F 1 in terms of the lexical and syntactic features described in Section 2.2. To illustrate the syntactic features common to F 1 and F 4 , we give the 1-best parse trees of these sentences in Figures  6 and 7.
In pruning-based approaches, if r 2 and r 3 appear infrequently in the training data, they are pruned out and information I is lost. If at decoding time candidate rules that share features with r 2 and r 3 are bad candidates to translate F 1 and F 4 then their application is not blocked by the discriminative model basing on I. For instance, if rules r 7 and r 8 have high scores in the hierarchical model but are bad candidates in the context of sentences F 1 and F 4 then a pruned model fails to block their application. In other words, the discriminative model does not learn that rules containing the lexical items practice and process on the target language side are bad candidates to translate F 1 and F 4 . As a consequence, the application of r 7 and r 8 to F 4 generates the erroneous translations E * 1 and E * 2 below.
The advantages of the of robotics aspects practice E * 2 The advantages of the aspects of robotics process

Experiments on small domains
In a first set of experiments, we evaluate our approach on two low resource French-English trans- lation tasks: (i) a set of scientific articles and (ii) a set of biomedical texts. As these data sets cover small domains, they allow us to investigate the usefulness of our approach in this context. The goal of our experiments is to verify three hypotheses: h 1 Our approach beats a hierarchical baseline. h 2 Our global model outperforms its local variants. h 3 Our exhaustive training procedure beats systems trained with pruned data.

Experimental Setup
Our scientific data consists of the scientific abstracts provided by Carpuat et al. (2013) (Stolcke, 2002). The training data for the language model is the English side of the training corpus for each task.
We train the model in the standard way, using GIZA++. After training, we reduce the number of translation rules using significance testing (Johnson et al., 2007). For feature extraction, we parse the French part of our training data using the Berkeley parser (Petrov et al., 2006) and lemmatize and POS tag it using Morfette (Chrupała et al., 2008). We train the rule-selection model using VW. All systems are tuned using batch MIRA (Cherry and Foster, 2012). We measure the overall translation quality using 4-gram BLEU (Papineni et al., 2002), which is computed on tokenized and lowercased data for all systems. Statistical significance is computed with the pairwise bootstrap resampling technique of (Koehn, 2004).

Compared Systems
We investigate systems including a discriminative model in the three setups, given in Figure 4.2. For each setup, we train a global model using a single classifier. For instance, for the setup (Lex-Glob) we train a classifier with the lexical and rule shape features presented in Section 2.2 together with their cross product.

Description
Name Rule shape and lexical features LexGLob Rule shape and syntactic features SyntGlob Rule shape, lexical and syntactic features LexSyntGlob In order to verify our first hypothesis (h 1 ), we show that our approach yields significant improvements over the hierarchical model in (Chiang, 2005). The results of this experiment are given in Table 2.
To verify our second hypothesis (h 2 ), we show that global rule selection models significantly improve over their local variants. For this second evaluation, we train local models with the feature templates in Figure 4.2. Local models with rule shape and lexical features are used in . We further test the performance of local rule selection models by also including syntactic features and a combination of those with the lexical features. We report the results in Table 3 where the local systems are denoted by LexLoc, SyntLoc and LexSyntLoc.
For our third hypothesis (h 3 ), we show that pruning hurts translation quality. To this aim, we take our best performing global model, which uses syntactic and rule shape features and perform heavy pruning of negative examples in the data used for classifier training. To exactly reproduce the context-based target model in (Cui et al., 2010), we pruned as many negative examples as necessary to obtain approximately the same amount of positive and negative examples they report. We removed negative instances created from rules with target side frequency < 5000. In the next section, we denote this system by SyntPrun and compare it to the hierarchical baseline as well as to our global model in Table 4.

Results
The outcome of our experiments confirm hypotheses h 1 and h 3 on all data sets and h 2 on medical data only.
The results of our first evaluation (Table 2) show that on all data sets our global rule selection model outperforms the hierarchical baseline (h 1 ).
The results of our second evaluation (i.e. local vs. global models in Table 3) show that h 2 holds on the medical domain only. On scientific data, global rule selection models in all setups perform slightly better than their local versions but the difference is not statistically significant. Note that all rule selection models except LexLoc outperform the hierarchical baseline. The best performing system is a global model with syntactic features (SyntGlob). On medical texts, global models outperform their local variants for all feature templates. In each setup, the improvement of local models over the global ones is statistically significant. SyntGlob achieves the best performance and yields significant improvements over the baseline. The good performance of SyntGlob on scientific and especially medical data can be explained by the fact that syntactic features are less sparse than lexical features and hence generalize better. This is especially important within a global model that allows feature sharing between source sides of rules. Even a combination of lexical and syntactic features underperforms syntactic features on their own because of the sparse lexical features.
The results of our third evaluation are displayed in Table 4. These show that on all data sets our global model without pruning outperforms the same model with pruned training data (h 3 ). These results also show that the pruned model fails to outperform the hierarchical baseline. Note that this result is consistent with the results reported   We use * to mark global systems that yield statistically significant (at confidence p < 0.05) improvements over their local variants. The results in bold are statistically significant improvements over the hierarchical baseline.
in (Cui et al., 2010): their Context-based target model yields very low improvements when used in isolation.

Large scale Experiments
In a second set of experiments, we evaluate the usefulness of our approach on two large scale translation tasks: (i) a French-to-English news translation task trained on 1,500,000 parallel sentences and (ii) an English-to-Romanian news translation task trained on 600,000 parallel sentences. The training data for the first task consists of the   Our goal is to verify if on large scale translation tasks our global rule selection model outperforms a hierarchical baseline (hypothesis h 1 above). The results, given in Table 5, show that on large scale tasks, rule selection models with syntactic features yield small improvements over the hierarchical baseline. However, none of these is statistically significant. Hence hypothesis h 1 does not hold on large domains.
6 Related Work (Marton and Resnik, 2008;Marton et al., 2012) improve hierarchical machine translation by augmenting the translation model with fine-grained syntactic features of the source sentence. The used features reward rules that match syntactic constituents and punish non-matching rules. (Chiang et al., 2009) integrate these features into a translation model containing a large number of other features such as discount or insertion features. (Chiang, 2010) extends the approach in (Marton and Resnik, 2008) by also including syntactic information of the target sentence that is built during decoding while (Liu et al., 2011) define a discriminative model over source side constituent labels instead of rewarding matching constituents. The training data for their model is based on source sentence derivations. 5 In contrast to this work, we define a rule selection model, i.e. a discriminative model on the target side of hierarchical rules. The training data for our model is based on the hierarchical rule extraction procedure: we acquire training instances by labeling candidate rules extracted from the same sentence pairs. Similar to our work, ) define a discriminative rule selection model including lexical features, similar to the ones we presented in Section 2.2. Their work bases on (Chan et al., 2007) which integrate a word sense disambiguation system into a hierarchical system. As opposed to , this work focuses on hierarchical rules containing only terminal symbols and having length 2. These approaches train rule selection models that are local to the source side of hierarchical rules. (He et al., 2010) generalize this work by defining a model that is local to source patterns instead of the source side of each rule. We extend these approaches by defining a global model that generalizes to all rules instead of rules with the same source side or source pattern. We also extend the feature set by defining models on syntactic features. (Cui et al., 2010) propose a joint rule selection model over the source and target side of hierarchical rules. Our work is similar to their Context Based Target Model (CBTM) but it integrates much more information by not reducing the rule selection problem to a binary classification problem and by not pruning the set of negative examples. We show empirically that the exhaustive training of our model significantly improves over their CBTM.
Finally, several authors train local rule selection models for different types of syntax-and semantics-based systems.  train a local discriminative rule selection model for treeto-string machine translation. (Zhai et al., 2013) propose a discriminative model to disambiguate predicate argument structures (PAS). In contrast, our rule selection model uses syntactic features on hierarchical rules and is a global model.
All 6 of the mentioned models are trained using the maximum entropy approach (Berger et al., 1996) which seems not to scale well as reported in (Cui et al., 2010). By using a high-speed streaming classifier we are able to train a global model doing true multi-class classification without pruning of training examples.

Conclusion and Future Work
We have presented two contributions to previous work on rule selection. First, we improved translation quality on low resource translation tasks by defining a global discriminative rule selection model trained on all available training examples. In a second contribution, we successfully scaled our global rule selection model to large scale translation tasks and presented the first evaluation of discriminative rule selection on such tasks. However, we failed so far to produce significant improvements in BLEU over a hierarchical baseline on large scale French-to-English and English-to-Romanian translation tasks. To allow researchers to replicate our results and improve on our work, we make our implementation publicly available as part of Moses.