A Comparison of Update Strategies for Large-Scale Maximum Expected BLEU Training

This work presents a ﬂexible and efﬁcient discriminative training approach for statistical machine translation. We propose to use the RPROP algorithm for optimizing a maximum expected B LEU objective and experimentally compare it to several other updating schemes. It proves to be more efﬁ-cient and effective than the previously proposed growth transformation technique and also yields better results than stochastic gradient descent and AdaGrad. We also report strong empirical results on two large scale tasks, namely BOLT Chinese → English and WMT German → English, where our ﬁnal systems outperform results reported by Setiawan and Zhou (2013) and on matrix.statmt.org. On the WMT task, discriminative training is performed on the full training data of 4M sentence pairs, which is unsurpassed in the literature.


Introduction
The main advantage of learning parameters in a discriminative fashion is the possibility to directly optimize towards a quality or error measure on the task that is being performed. This stands in contrast to the generative approach, where parameters are chosen to maximize likelihood under a generative story, which often bears little correspondence with the actual application of the model.
In statistical machine translation (SMT), extending the generative noisy-channel formulation (Brown et al., 1993) as a discriminative, log-linear combination of multiple models (Och, 2003) has become the state of the art. However, most of the component models are still estimated by heuristics or generative training. In this paper, a flexible, efficient and easy to implement discriminative training scheme for SMT is presented. It can be applied to any kind and any number of features. We use the RPROP algorithm to optimize a maximum expected BLEU objective. n-best lists approximate the infeasibly large space of translation hypotheses. They are generated with the application of leave-one-out to make them more representative with respect to unseen data.
We make the following main contributions: 1. We propose to apply the RPROP algorithm for maximum expected BLEU training and perform an experimental comparison with growth transformation (GT) (He and Deng, 2012;Setiawan and Zhou, 2013), stochastic gradient descent (Auli et al., 2014) and AdaGrad (Green et al., 2013). RPROP yields superior performance, reaching a total improvement of 1.2 BLEU points over our IWSLT German→English baseline using 5.22M features.

Related Work
Discriminative training is one of the most active research areas in SMT and it can be integrated into the pipeline at various stages. Och (2003) proposed to apply minimum error rate training (MERT) to optimize the different feature weights in the log-linear model combination on a small development data set. This is still considered to be the state of the art, but is only capable of optimizing a handful of features. More recently, MIRA (Watanabe et al., 2007;Chiang et al., 2008) and PRO (Hopkins and May, 2011) have been presented as optimization procedures that can replace MERT and scale to thousands of parameters.
In a different line of work, Liang et al. (2006) describe a fully discriminative training pipeline, where more than one million features are tuned on the training data using a perceptron-style update algorithm. The Direct Translation Model 2 introduced by Ittycheriah and Roukos (2007) is similar in that it also trains millions of features on the training data. However, the weights are estimated based on a maximum entropy model and the underlying translation paradigm differs from the standard phrasebased model. Gao and He (2013) use gradient ascent to train Markov random field models for phrase translation. These models are interpreted as undirected phrase compatibility scores rather than translation probabilities. Thus, as in our work, they are not subject to a sum-to-one constraint. Simianer et al. (2012) propose a distributed setup for large-scale discriminative training with joint feature selection. The training corpus is divided into several shards, on which features are updated via perceptron-style gradient descent. The authors present results showing that training on large data sets improves results over just using a small development corpus. Another approach based on the AdaGrad method that scales to large numbers of sparse features is proposed in (Green et al., 2013;Green et al., 2014). Different from our work, the authors use either the tuning sets or a small subsample of the training data (15k sentences) for discriminative training.
A notably different idea is pursued by Yu et al. (2013), who present a large-scale training procedure that explicitly minimizes search errors. This is achieved by force-decoding the training data and updating at the point where the correct derivation drops off the beam.
In (Blunsom et al., 2008), conditional random fields (CRFs) are trained within a hierarchical phrase-based translation framework. The hierarchical phrase-based paradigm is used to model the search space in model estimation and search, leaving the hypothesis weighting to CRF features. They constrain search by a beam width for gradient estimation and update the model with the help of L-BFGS. In a similar way Lavergne et al. (2011) use the n-gram based approach (Casacuberta and Vidal, 2004;Mariño et al., 2006) to model the reordering, phrase alignment, and the language model. A CRF is applied to estimate the phrase weights. Model updates are carried out by the RPROP algorithm (Riedmiller and Braun, 1993). However, both approaches only improve over constrained baselines.
Our work is inspired by (He and Deng, 2012;Setiawan and Zhou, 2013), where the authors propose to train the standard phrasal and lexical channel models with the growth transformation (GT) algorithm. They use n-best lists on the training data and optimize a maximum expected BLEU objective, that provides a clear training criterion, which is missing e.g. in MIRA estimation. Auli et al. (2014) report good results by applying the same objective function to reordering features, which are trained with stochastic gradient descent (SGD).
Our work differs in several key aspects: (i) We propose to apply the RPROP algorithm, which yields superior results to GT, SGD and AdaGrad in our experimental comparison. (ii) For the first time, we apply maximum expected BLEU training on a data set as large as four million sentence pairs. (iii) We apply a leave-one-out heuristic (Wuebker et al., 2010) to make better use of the training data. (iv) We apply phrasal, lexical, reordering and triplet features. (v) Finally, we do not run MERT after each training iteration, which is expensive for large translation systems.

Statistical Translation System
Our work can be applied to any statistical machine translation paradigm and we will present results on a standard phrase-based translation system (Koehn et al., 2003) and a hierarchical phrase-based translation system (Chiang, 2005). The translation process is implemented as a weighted log-linear combination of several models h m,Θ (E, F ), where E = e 1 , . . . , e I denotes the translation hypothesis, F = f 1 , . . . , f J the source sentence, m a model index, and Θ the model parameters. These models include the phrase translation and lexical smoothing scores in both directions, language model (LM) score, distortion penalty, word penalty and phrase penalty (Och and Ney, 2004). Given a source sentence F , the models h m,Θ (E, F ) and the corresponding loglinear feature weights λ m , the translation decoder searches for the best scoring translationÊ: where . . . , λ m , . . . are the model weighting parameters. In practice, the Viterbi approximation is ap-plied and for simplicity, in the following we will assume the particular derivation for a translation hypothesis to be included in the variable E. The loglinear feature weights are optimized with minimum error rate training (MERT) (Och, 2003).

Previously Proposed Algorithms
The Growth Transformation (GT) or Extended Baum-Welch Algorithm was proposed by He and Deng (2012) for maximum expected BLEU training of the standard phrasal and lexical channel models. It is an algorithm to iteratively optimize polynomials of random variables that are subject to sum-to-one contraints and is therefore suitable for training probability distributions. The disadvantage is that each parameter update requires a renormalization step, which artificially blows up the number of features that need to be changed and has a significant impact on time and memory efficiency. The update formulas are derived in (He and Deng, 2012). Stochastic Gradient Descent (SGD) is a wellknown and frequently applied training scheme, which is used for maximum expected BLEU training of reordering models by (Auli et al., 2014). It performs the following update: Here, the disadvantage is its high sensitivity to the fixed learning rate η. However, as it does not subject the features to sum-to-one-contraints, it is considerably more time and memory efficient than GT.
As an improvement to SGD, AdaGrad (Duchi et al., 2011) is designed for large, sparse feature sets and makes use of an adaptive learning rate. It was proposed for MT training by (Green et al., 2013). Although its main area of application are online algorithms, it is also applicable in our offline setting and is more robust than SGD due to the adaptive learning rate. Following (Green et al., 2013), we apply the approximation with a diagonal outer product matrix, which is computationally cheap. This results in the update equations

RPROP
The resilient backpropagation algorithm (RPROP) proposed by Riedmiller and Braun (1993) is a gradient-based optimization algorithm that emprirically learns the step size without taking the slope into account, making it highly robust and avoiding the need for a learning rate. If the gradient switches algebraic sign compared to the previous iteration, the last step is reverted and the step size reduced. If the sign remains the same, the step size is increased. Formally, given a set of parameters Θ and an objective function O(Θ), in iteration t each parameter ϑ ∈ Θ is updated according to δϑ denotes the derivative of the objective function. The step size ∆ϑ (t) > 0 grows or decreases depending on the sign of the gradient: , else The strength parameters 0 < η − < 1 ≤ η + usually have little impact and are fixed to η − = 0.5 and η + = 1.2 throughout this work. The RPROP algorithm is simple and easy to implement. It has proven effective for a number of tasks, e.g. in (Wiesler et al., 2013;Heigold et al., 2011;Lavergne et al., 2011;. Different from growth transformation (cf. Sec. 4.1), it does not assume a probability distribution and performs its updates without a sum-to-one constraint. Compared to SGD and AdaGrad, RPROP's practical advantage is the absence of a learning rate that needs to be tuned. Further, we see its theoretical advantage in the empirically learned step size. In the first iterations, RPROP's updates are considerably smaller than with the other strategies, resulting in a more careful exploration of the search space. In higher iterations, the update steps for good features keep growing and we observe an exponential increase of the objective function. In contrast, GT, SGD, and AdaGrad determine the size of their update step based on the slope of the gradient, which we believe to be misleading given the complex topology of the feature space in MT.

Maximum Expected BLEU
Following (He and Deng, 2012), we want to optimize a maximum expected BLEU objective. We denote the universe of possible sentences in the source language as F and in the target language as E. The expected BLEU score under parameter set Θ with respect to the joint probability distribution p Θ (·, ·) is defined as Here, β(E) is the BLEU score for target sentence E (assuming the reference translation to be part of the mapping β) and we use the notation · to denote the expectation. Enumerating all possible source and target sentences F , E is infeasible. Therefore, we estimate the empirical expectation on a corpus C ⊂ E × F. We denote the source sentences in C as C F and the size of the corpus as N = |C|. The joint probability p Θ (E, F ) is decomposed with the help of the Bayes Theorem, resulting in: For p(F ) = N F N we assume the empirical distribution within the training corpus, where N F is the count of sentence F . The summation over all E ∈ E is sampled with a subset E Θ (F ) of the most likely hypotheses with respect to the parameterized probability p Θ (E, F ), which in practice is an n-best list generated by the decoder. Iterating over the cor- We use the same unclipped sentence-level BLEU-4 score with smoothed 3-gram and 4-gram precisions as in (He and Deng, 2012), which we denote as β(E) = BLEU(E, E * n ) with respect to the reference translation E * n .
The normalized posterior translation probability p Θ (E|F ) from source sentence F to target sentence E approximates a maximum entropy model normalized on sentence level: The denominator of this probability does not depend on the output sentence. Thus, the arg max of Equation 8 is equal to the arg max of the translation score in Equation 1.
Maximum Entropy models tend to generalize poorly, which can be circumvented by regularization. He and Deng (2012) use Kullback-Leibler regularization, raising the need of having normalized models h m,Θ (E, F ). We employ the more general L 2 -regularization and the objective function is defined as including the hyper parameter τ controlling the degree of regularization. The derivative of the objective function, which is needed for the gradient-based training methods, directly follows: With ∂h m,Θ (E,F ) ∂ϑ = # ϑ (E, F ) the number of times feature ϑ fires in the derivation for translation hypothesis E given source sentence F , the derivative of p Θ (E|F ) is defined as (for ease of notation And the derivative of the expected BLEU is This can be more compactly expressed by local expectations · n of the BLEU score and the feature count # ϑ : In our implementation, # ϑ is moved to the front of the equation to obtain common factors that can be used by all parameter updates:

Leave-one-out
Although He and Deng (2012) claim that it is not necessary, we apply a leave-one-out heuristic similar to (Wuebker et al., 2010) when generating the n-best lists on the training data. The authors have shown this to effectively counteract over-fitting effects and we argue that it helps to bring out the full potential of our discriminative training procedure. When we decode the training data of our translation model, very long and rare phrases can be used to translate the sentence. The translation probability for these phrases, which are often singletons, are generally over-estimated by the heuristic count model. When they are too dominant in the n-best lists they effectively render the training data useless, as they are unlikely to generalize to unseen data. The idea of leave-one-out is that for decoding each sentence, the global counts of the relative frequency estimates are reduced by the local counts extracted from the current sentence pair. This way, the above mentioned rare phrases are penalized and the decoder is encouraged to use more general phrases taken from the remainder of the training data. Singleton phrases are given a fixed penalty. In this work, we apply leave-one-out with all update strategies.

Features
Maximum expected BLEU training facilitates training of arbitrary features. In this work we apply four types of features. (a) A discriminative phrase table, i.e. one feature for each phrase pair. (b) Lexical features, i.e. one feature for each source-target word pair that appear within the same phrase. (c) Source and target triplet features (Hasan et al., 2008), i.e. triples of one source and two target words or one target and two source words appearing within a single phrase pair. (d) The hierarchical lexicalized reordering model (Galley and Manning, 2008), i.e. one feature for each combination of phrase pair, orientation (monotone (M), swap (S) or discontinuous (D)) and orientation direction (forward or backward). GT is only applied with feature set (a), where we reestimate the two phrasal channel models as was done in (He and Deng, 2012). With the other update algorithms we follow the approach taken in (Auli et al., 2014) and condense each feature type into a small number of models for the log-linear combination, which is afterwards tuned with MERT. (a) and (b) result in a single additional model, (c) in two models (source and target triplets) and (d) in six models ({forward,backward}×{M,S,D}).

Efficient Implementation
The expected BLEU β Θ is efficiently computed in one iteration over the full n-best list. As can be seen from Equation 13, the derivative δ β Θ δϑ is additive with respect to each firing instance of feature ϑ in the n-best list. The additive factor only depends on the current sentence pair. Therefore, for each sentence of the training data we iterate through its n-best list once to compute the expectation of the sentence-level BLEU score β n and then a second time to update the current derivative for each time the feature fires. The only thing that needs to be kept in memory is a list of the current derivatives for each parameter ϑ.
1. Create the baseline system and run MERT 2. Generate n-best list on training corpus 3. Compute sentence-level BLEU β(E n ) for each hypothesis E n in the list 4. Initialize parameters with ϑ = 0, ∀ϑ ∈ Θ 5. Iterate: a) Compute the derivatives δO(Θ) δϑ b) Perform update and output Θ (t) 6. Run MERT on dev with each table Θ (t) 7. Select best Θ (t) on dev 8. Evaluate on test sets

Complete Training Algorithm
The complete training and evaluation procedure is shown in Figure 1. We start by building a baseline translation system with MERT-optimized model weights λ. With the baseline system we generate nbest lists on the training data. Now, for each translation hypothesis E n of the n-best list, we compute the sentence-level BLEU score β(E n ) and initialize the parameter set for training with the count model. Next, we run the training algorithm for a fixed number of iterations 1 and output the updated feature values Θ (t) after each iteration t. Finally, we run MERT with each Θ (t) , select the best table on dev and evaluate on our test sets.

Setup
The experiments are carried out on the IWSLT 2013 German→English shared translation task. 2 For rapid experimentation, the translation model is trained on the in-domain TED portion of the bilingual data, which is also used for maximum expected BLEU training. However, we use a large 4-gram LM with modified Kneser-Ney smoothing (Kneser and Ney, 1995;Chen and Goodman, 1998), trained with the SRILM toolkit (Stolcke, 2002 and LDC English Gigaword corpora. The selection is based on cross-entropy difference (Moore and Lewis, 2010). This makes for a total of 1.7 billion running words for LM training. The baseline further contains a hierarchical reordering model (HRM) (Galley and Manning, 2008) and a 7-gram word class language model . On IWSLT, all results are averages over three independent MERT runs, and we evaluate statistical significance with MultEval (Clark et al., 2011). To confirm our findings, additional experiments are run on two large-scale tasks over strong baselines including recurrent neural language models. On the DARPA BOLT Chinese→English task we use our internal evaluation system as a baseline. It is a powerful hierarchical phrase-based SMT engine with 19 dense features, including an LSTM recurrent neural language model (Sundermeyer et al., 2012) and a hierarchical reordering model (Huck et al., 2013). The 5-gram backoff LM is in total trained on 2.9 billion running words. We use the same data for tuning and testing as Setiawan and Zhou (2013), namely 1275 (tune) and 1239 3 sentences of web data taken from LDC2010E30, the NIST MT06 evaluation set and an additional single-reference test set from the discussion forum (df) domain containing 1124 sentence pairs. Maximum expected BLEU training is performed on the discussion forum portion of the training data, consisting of 67.8K sentence pairs. On the German→English task of the 9th Workshop on Statistical Machine Translation 4 , both translation model and maximum expected BLEU training is performed on all available bilingual data. Our baseline is a phrase-based translation engine with a 4gram backoff LM trained on 2.5 billion words with lmplz (Heafield et al., 2013), a recurrent neural 3 named dev in (Setiawan and Zhou, 2013)   LM, a 7-gram word class LM and the HRM. Bilingual data statistics for all tasks are given in Table 1. We use the machine translation toolkit Jane (Vilar et al., 2010;Wuebker et al., 2012) and evaluate with case-insensitive BLEU [%] (Papineni et al., 2002) in all experiments. Table 2 shows the IWSLT results. We first compare the performance of the four update algorithms, for simplicity only on the discriminative phrase table features. Different from previous work the nbest lists of the training data were generated with leave-one-out, unless otherwise stated. In all cases we tested different values for the regularization parameter τ and in the case of SGD and AdaGrad also for the learning rate η. We selected the best configurations based on a validation set (test2011). For AdaGrad we also experimented with FOBOS regularization and feature selection (Duchi and Singer, 2009), but did not observe improved results. As expected, we found that in all cases regularization is not strictly necessary -results are barely affected as long as τ is sufficiently small -and that SGD is much more sensitive to η than AdaGrad. Further, SGD and RPROP need around 25 iterations to reach good results, where 5-10 iterations are sufficient for GT and AdaGrad. For a fair comparison, however, we run all algorithms for 40 iterations and select the best one on a seletion set, namely iterations 19 (Ada-Grad), 23 (GT), 29 (RPROP) and 35 (SGD). Figure  2 shows how the expected BLEU function evolves in training with different update strategies. Although the value for GT is not directly comparable to the others due to a different regularization term, the respective characteristics are clearly visible. SGD exhibits a linear growth pattern, GT resembles a logarithmic and RPROP an exponential function. After initially overshooting and then retracting as the regularization kicks in, AdaGrad also displays logarithmic characteristics.

Experimental Results
In terms of BLEU RPROP performs best, followed by AdaGrad, GT and SGD, where the RPROP-AdaGrad and AdaGrad-GT differences are small (0.2% BLEU absolute) but statistically significant on the 95% level. Altogether, RPROP improves over the baseline by 0.9 BLEU points, which is statistically significant at the 99% level. In an additional experiment we verified that leave-one-out has a clear  impact on the results. The BLEU difference between RPROP with and without leave-one-out is 0.6% absolute. By adding lexical, triplet and reordering features, we get an additional gain and observe a total improvement of 1.2 BLEU points over the baseline system. Efficiency comparison. 921K discriminative phrase table features are active in our training data. Due to the renormalization component, this results in a total of 6.08M features that are updated with GT using the same data. Consequently, it is less time and space efficient than the other algorithms. With our implementation, GT needed around 16 hours and 6.7G memory for 40 iterations, where RPROP, AdaGrad and SGD finished after less than 2.5 hours and required 2.1G memory.
For the BOLT task, we directly compare with the GT-trained system in (Setiawan and Zhou, 2013) using the same tune set for MERT and reporting results on the same test sets, see Table 3. With RPROP we achieve nearly twice the improvement reported by Setiawan on both web and MT06 using feature sets (a)-(c) 5 . Our baseline on web is already much stronger and RPROP training yields +0.7 BLEU points, as opposed to +0.44 reported by Setiawan. On MT06 our baseline system is slightly worse, but with the larger gain received by RPROP our final system outperforms the one reported by Se-5 Reordering features are not applicable to our hiero system.  tiawan by 0.2 BLEU points. We would like to stress that this is not a domain adaptation effect, as maximum expected BLEU training was performed on discussion forum (df) data. On the df test set, on the other hand, we probably can observe domain adaptation via RPROP training. The improvement here is 0.7% BLEU absolute with a single reference, as opposed to four references on web and MT06. We also report results training the same feature sets with SGD and AdaGrad, confirming results we observed on IWSLT. Here, SGD yields only minor improvements. AdaGrad performs better, but still 0.1 -0.4 BLEU points worse than RPROP. Running GT is infeasible in our hierarchical phrase-based setup. Table 4 shows the results on the WMT task. This is our largest setting, where max. exp. BLEU training is performed on the full training data with more than 4M sentence pairs which, to the best of our knowledge, is unsurpassed in the literature. Altogether, training took more than one month, about 3/4 of which were for generating n-best lists by decoding the training data. The triplet features did not finish in time, so we applied the feature sets (a), (b) and (d), 45M features in total. With a renormalization step as in GT, this number would grow to 309M. On newstest2013, our baseline already outperforms the best single system reported on matrix.statmt.org by 0.2 BLEU points. The discriminatively trained features yield an additional improvement of 0.6% BLEU absolute on this high-end system.

Conclusion
We have experimentally compared several update strategies for maximum expected BLEU training. The RPROP algorithm proposed in this work shows superior performance compared to AdaGrad, growth transformation (GT) and stochastic gradient descent. In terms of time and memory efficiency, GT is clearly inferior to the other algorithms due to renormalization. Applying phrasal, lexical, triplet and reordering features, the baseline is improved by 1.2% BLEU absolute on the IWSLT German→English task. On two large scale tasks we achieve clearly superior performance compared to results reported in the literature. On BOLT Chinese→English our discriminative training yields nearly twice the improvement reported by Setiawan and Zhou (2013), resulting in a superior final system. On WMT German→English, we outperform the best single system reported on matrix.statmt.org by 0.8% BLEU absolute. Here, we perform maximum expexted BLEU training on more than 4M sentence pairs, which is the largest number reported in the literature to date.