Lexicalized Reordering for Left-to-Right Hierarchical Phrase-based Translation

Phrase-based and hierarchical phrase-based (Hiero) translation models differ radically in the way reordering is modeled. Lexicalized reordering models play an important role in phrase-based MT and such models have been added to CKY-based decoders for Hiero. Watanabe et al. (2006) proposed a promising decoding algorithm for Hiero (LR-Hiero) that visits input spans in arbitrary order and produces the translation in left to right (LR) order which leads to far fewer language model calls and leads to a considerable speedup in decoding. We introduce a novel shift-reduce algorithm to LR-Hiero to decode with our lexicalized reordering model (LRM) and show that it improves translation quality for Czech-English, Chinese-English and German-English.


Introduction
Phrase-based machine translation handles reordering between source and target languages by visiting phrases in the source in arbitrary order while generating the target from left to right. A distortion penalty is used to penalize deviation from the monotone translation (no reordering) (Koehn et al., 2003;Och and Ney, 2004). Identical distortion penalties for different types of phrases ignore the fact that certain phrases (with certain words) were more likely to reorder than others. State-of-the-art phrase based translation systems address this issue by applying a lexicalized reordering model (LRM) (Tillmann, 2004;Koehn et al., 2007;Galley and Manning, 2008;Galley and Manning, 2010) which uses word aligned data to score phrase pair reordering. These models distinguish three orientations with respect to the previously translated phrase: monotone (M), swap (S), * This work was done while the first author was a Ph.D. student at SFU. and discontinuous (D), which are primarily designed to handle local re-orderings of neighbouring phrases.
Hierarchical phrase-based translation (Hiero) (Chiang, 2007) uses hierarchical phrases for translations represented as lexicalized synchronous context-free grammar (SCFG). Non-terminals in the SCFG rules correspond to gaps in phrases which are recursively filled by other rules (phrases). The SCFG rules are extracted from word and phrase alignments of a bitext. Hiero uses CKY-style decoding which parses the source sentence with time complexity O(n 3 ) and synchronously generates the target sentence (translation). Watanabe et al. (2006) proposed a left-to-right (LR) decoding algorithm for Hiero (LR-Hiero) which follows the Earley (Earley, 1970) algorithm to parse the source sentence and synchronously generate the translation in a left-to-right manner. This algorithm is combined with beam search and has time complexity O(n 2 b) where n is the length of source sentence and b is the size of beam (Huang and Mi, 2010). LR-Hiero constrains the SCFG rules to be prefix-lexicalized on the target side aka Greibach Normal Form (GNF). Throughout this paper we abuse the notation for simplicity and use the term GNF grammars for such SCFGs. This leads to a single language model (LM) history for each hypothesis and speeds up decoding significantly, up to four times faster (Siahbani et al., 2013). The Hiero translation model handles reordering very differently from a phrase-based model, through weighted translation rules (SCFGs) determined by non-terminal mappings. The rule X → ne X 1 pas, do not X 1 indicates the translation of the phrase between ne and pas will be after the English phrase do not. However, reordering features can also be added to the Hiero log-linear translation model. Siahbani et al. (2013) introduce a new distortion feature to Hiero and LR-Hiero which  Figure 1: The process of translating a Chinese (Fig. 2) sentence to English using LR-Hiero. Left side shows the rule used in each step of creating the derivation. The hypotheses column shows 3-tuple partial hypotheses: the translation prefix, ht, the ordered list of yet-to-be-covered spans, hs, and cost hc. significantly improves translation quality in LR-Hiero and improves Hiero results to a lesser extent. Nguyen and Vogel (2013) integrate phrase-based distortion and lexicalized reordering features with CKY-based Hiero decoder which significantly improve the translation quality. In their approach, each partial hypothesis during decoding is mapped into a sequence of phrase-pairs then the distortion and reordering features are computed similar to phrase-based MT. They use a LRM trained for phrase-based MT (Galley and Manning, 2010) which applies some restrictions on the Hiero rules. (Cao et al., 2014;Huck et al., 2013) propose different approaches to directly train LRM for Hiero rules. However, these approaches are designed for CKY-decoding and cannot be directly used or adapted for LR-Hiero decoding which uses an Earley-style parsing algorithm. The crucial difference is the nature of bottom-up versus left to right decisions for lexicalized reordering and generating the translation in left-to-right manner. In this paper, we introduce a novel shift-reduce algorithm to learn a lexicalized reordering model (LRM) for LR-Hiero. We show that augmenting LR-Hiero with an LRM improves translation quality for Czech-English, significantly improves results for Chinese-English and German-English, while performing three times fewer language model queries on average, compared to CKY-Hiero.

Lexicalized Reordering for LR-Hiero
The main idea in phrase-based LRM is to divide possible reorderings into three orientations that can be easily determined during decoding and also from word-aligned sentence pairs (parallel corpus). Given a source sentence f, a sequence of target language phrases e = (ē 1 , . . . ,ē n ) is generated by the decoder. A phrase alignment a= (a 1 , . . . a n ) defines a source phrasef a i for each target phraseē i . For each phrase-pair f a i , e i , the orientations are described in terms of the previously translated source phrasef a i−1 : We only define the left-to-right case here; the right-to-left case (f a i+1 ) is symmetrical. The probability of an orientation given a phrase pair f ,ē can be estimated using relative frequency: where, o ∈ {M, S, D} and cnt is computed on word-aligned parallel data (count phrase-pairs and their orientations). Given the sparsity of the orientation types, we use smoothing. As the decoder develops a new hypothesis by translating a source phrase,f a i , it scores the orientation, o i wrt a i−1 . The log probability of the orientation is added as a feature function to the log-linear translation model. LR-Hiero uses a subset of the Hiero SCFG rules where the target rules are in Greibach Normal Form (GNF): γ,ē β where γ is a string of nonterminal and source words,ē is a target phrase and β is a possibly empty sequence of non-terminals. We abuse notation slightly and call this a GNF SCFG grammar. In LR-Hiero each hypothesis consists of a translation prefix, h t , an ordered sequence of untranslated spans on the source sen-tence, h s and a numeric cost, h c . The initial hypothesis consists of an empty translation ( s ), a span of the whole source sentence and cost 0 (Figure 1). To develop a new hypothesis from a current hypothesis, the LR-Hiero decoder applies a GNF rule to the first untranslated span, h s [0], of old hypothesis. The translation prefix of the new hypothesis is generated by appending the target side of the applied rule,ē, to the translation prefix of the old hypothesis, h t . Corresponding to the applied rule, the uncovered spans of the old hypothesis are also updated and assigned to the new hypothesis ( Figure 1).
Target generation in LR-Hiero is analogous to phrase-based MT. Given an input sentence f, the output translation is a sequence of contiguous target-language phrases e = (ē 1 , . . . ,ē n ) incrementally concatenated during decoding. We can define a phrase alignment a = (a 1 , . . . a n ) which align each target phrase,ē i to a source phrase f a i corresponding to source side of a rule, r i used at step i. But unlike target, source phrases can be discontiguous. Figure 1 illustrates the process of translating a Chinese-English sentence pair by LR-Hiero. Corresponding to each rule a phrase pair can be created (shown in Figure 2). The final translation is the ordered sequence of target side of these phrase pairs. Although the target generation is similar to phrase-based MT, the LR-Hiero decoder parse the source sentence using the SCFG rules and the order for translating source spans is determined by the grammar. However the LR-Hiero decoder uses an Earley-style parsing algorithm and unlike CKY does not utilise translated smaller spans to generate translations for bigger spans bottom-up.

Training
We compute P (o|f ,ē), which is the probability of an orientation given phrase pair of a rule, r.p = f ,ē , on word-aligned data using relative frequency. We assume that phraseē spans the word range s . . . t in the target sentence and the phrasē f spans the range u . . . v in the source sentence. For a given phrase pair f ,ē , we set o = M if there is a phrase pair, f ,ē , where its target side, e , appears just before the target side of the given phrase,ē, or s = t + 1, and its source side,f , also appears just beforef , or u = v + 1. Orientation is S if there is a phrase pair, f ,ē , whereē appears just beforeē, or s = t + 1, andf appears just afterf , or v = u −1. Otherwise orientation is rules ri.f Oi S   D. We consider phrase pairs of any length to compute orientation. Note that although phrase pairs extracted from the rules that can be discontinuous (on source), just continuous source phrases in each sentence pair are used to compute orientation (previously translated phrases). Once orientation counts for rules (phrase-pairs obtained form rules) are collected from the bitext, the probability model P (o|f ,ē) is estimated using recursive MAP smoothing as discussed in (Cherry, 2013).

Decoding
Phrase-based LRM uses local information to determine orientation for a new phrase pair, f a i ,ē i , during decoding (Koehn et al., 2007;Tillmann, 2004). For left-to-right order,f a i is compared to the previously translated phrasef a i−1 . Galley and Manning (2008) introduce the hierarchical phrase reordering model (HRM) which increases the consistency of orientation assignments. In HRM, the emphasis on the previously translated phrase is removed and instead a compact representation of the full translation history, as represent by a shiftreduce stack, is used. Once a source span is translated, it is shifted onto the stack; if the two spans on the top are adjacent, then a reduction merges the two. During decoding, orientations are always determined with respect to the top of this stack, rather than the previously translated phrase.
Although we reduce rules to phrase pairs to train the reordering model, LR-Hiero decoder uses SCFG rules for translation and the order of source phrases (spans) are determined by the non-terminals in SCFG rules. Therefore we cannot simply rely on the previously translated phrase to compute the orientation and reordering scores. Since LR-Hiero uses lexicalized glue rules (Watanabe et al., 2006), non-terminals can be matched to very long spans on the source sentence. It makes LRM in LR-Hiero comparable to HRM in phrase-based MT. However, we cannot rely on the full translation history like HRM, since translation model is a SCFG grammar encoding reordering information.
We employ a shift-reduce approach to find a compact representation of the recent translated source spans which is also represented by a stack, S, for each hypothesis. However, S always contains just one source span (which might be discontiguous), unlike HRM which maintains all previously translated solid spans (In Figure 4, the dotted lines shows the only span in the stack during LR-Hiero decoding). As the decoder applies a rule, r i , the corresponding source phrase r i .f is compared respect to the span in S to determine the orientation. If they are adjacent or S covers the span r i .f , they are reduced. Otherwise stack is set to the span of new rule, S = r i .f . The orientation of r i .f is computed with respect to S but if they are not adjacent (M or S), we still need to consider the possible local reordering with respect to the previous rule r i−1 .f . In Figure 3, rules #5,#4 are monotone, while both are covered by the current span in S. Since the stack always contains one span, this algorithm runs in O(1). Therefore, only a limited number of comparisons is used to update S and compute orientation. Unlike HRM which needs to maintain a sequence of contiguous spans in the stack and runs in linear time. Figure 3 illustrates the application of shiftreduce approach to compute orientation for initial decoding steps of a Chinese-English sentence pair shown in Figure 4. We show source words in the rules with the corresponding index in the source sentence. S and r i .f for the initial hypothesis are set to −1, corresponding to the start of sentence symbol, making it easy to compute the correct orientation for spans at the beginning of the input (with index 0).

Experiments
We evaluate lexicalized reordering model for LR-Hiero on three language pairs: German-English (De-En), Czech-English (Cs-En) and Chinese-English (Zh-En). Table 1 shows the corpus statistics for all language.
We train a 5-gram LM on the Gigaword corpus using KenLM (Heafield, 2011). The weights in the log-linear model are tuned by minimizing BLEU loss through MERT (Och, 2003) on the dev set for each language pair and then report BLEU scores on the test set. Pop limit for Hiero and LR-Hiero is 500 and beam size for Moses is 1000. Other extraction and decoder settings such as maximum phrase length, etc. are identical across different settings.
We use 3 baselines in our experiments: • Hiero: we use our in-house implementation of Hiero, Kriya, in Python (Sankaran et al., 2012). Kriya can obtain statistically significantly equal BLEU scores when compared with Moses (Koehn et al., 2007) for several language pairs Callison-Burch et al., 2012).
To make the results comparable we use the standard SMT features for log-linear model in translation systems. relative-frequency translation probabilities p(f |e) and p(e|f ), lexical translation probabilities p l (f |e) and p l (e|f ), a language model probability, word count, phrase count and distortion. In addition, two distortion features proposed   by (Siahbani et al., 2013) are added to both Hiero and LR-Hiero. The LRM proposed in this paper uses a GNF grammar and LR decoding, therefore we apply it only to LR-Hiero. The GNF rules are obtained from word and phrase aligned bitext using the rule extraction algorithm proposed by (Siahbani and Sarkar, 2014a). Table 3 compares the performance of different translation systems in terms of translation quality (BLEU). In all language pairs the proposed lexicalized reordering model improves the translation quality of LR-Hiero. These observations are comparable to the effect of LRM in phrase-based translation system. In Cs-En, LRM gets the best results and it significantly improves the the LR-Hiero results for De-En and Zh-En (p-value<0.05, evaluated by MultEval (Clark et al., 2011)). To compare our approach to Nguyen and Vogel (2013), we adopt their algorithm to LR-Hiero and use the same LRM trained for GNF rules (marked as NVLRM in Table 3). Unsurprisingly this approach could not improve the translation quality in LR-Hiero. This approach computes the LRM for all candidate translation of each span after obtain-ing the full translations. In bottom-up decoders it helps to prune the hypotheses effectively while in LR-Hiero decoder as we apply a rule before knowing the translation of smaller spans the computation of LRM will be postponed and gets less effective in decoding. Table 2 shows the performance in terms of decoding speed. We use the same wrapper for Hiero and LR-Hiero to query the language model and report the average on a sample set of 50 sentences from test sets. We can see LR-Hiero+LRM still works 3 times faster than Hiero in terms of number of LM calls which leads to a faster decoder speed.

Conclusion
We have proposed a novel lexicalized reordering model (LRM) for the left-to-right variant of Hiero called LR-Hiero distinct from previous LRM models. The previous LRM models proposed for Hiero are just applicable to bottom-up decoders like CKY. We proposed a model for the left-toright decoding algorithm of LR-Hiero. We showed that our novel shift-reduce algorithm to decode with the lexicalized reordering model significantly improved the translation quality of LR-Hiero on three different language pairs.