Extended Translation Models in Phrase-based Decoding

We propose a novel extended translation model (ETM) to counteract some problems in phrase-based translation: The lack of translation context when using single-word phrases and uncaptured dependencies beyond phrase boundaries. The ETM operates on word-level and augments the IBM models by an additional bilingual word pair and a reordering operation. Its implementation in a phrase-based decoder introduces translation and reordering dependencies for single-word phrases and dependencies across phrase boundaries. More, the model incorporates an explicit treatment of multiple and empty alignments. Its integration outperforms competitive systems that include lexical and phrase translation models as well as hierarchical reordering models on 4 language pairs signiﬁcantly by +0.7% BLEU on average. Although simpler and using fewer dependencies, the ETM proves to be on par with 7-gram operation sequence models (Durrani et al., 2013b).


Introduction
The first successful steps in Statistical Machine Translation have been taken by applying word-based models in a source-channel approach (Brown et al., 1990;Brown et al., 1993). Within this framework, the language model (LM) is estimated on monolingual n-grams, whereas the translation models IBM-1 to IBM-5 are trained on bilingual data using word alignments. The disadvantage of word-to-word translation is overcome by phrase-based translation (PBT) (Och et al., 1999;Zens et al., 2002;Koehn et al., 2003) and log-linear model combination . The open question is how much actual lexical context is included in decoding. Figure  1 depicts the relative word frequencies plotted against the length of the phrase they were translated with for the IWSLT 2014 1 German→English and English→French tasks. For English→French, more than 40% of the words are translated using single-or two-word phrases, i.e. with a lexical context of at most one word. For the German→English task, more reorderings occur and lead to less monotone alignments. Here, even 60% of all words are translated with a lexical context of at most one single word and over 20% are translated without any lexical context at all. We address this problem by developing two variants of extended translation models (ETM), the direct (EdTM) for the Source→Target and the inverse (EiTM) for the Target→Source direction. They operate on word-level and augment the IBM models by an additional bilingual word pair and a reordering operation. We introduce them into the log-linear framework of a PBT system. Thus, the decoding of single-word phrases can benefit from lexical and reordering context. Moreover, the ETM allows to capture dependencies across phrase boundaries and long-range source dependencies. It incorporates reordering information for non-monotone and multiple alignments including unaligned words.
As a first step, we implement the ETM as a count model with interpolated Kneser-Ney smoothing (Chen and Goodman, 1998) using the Viterbi alignment and apply it in phrase-based decoding. Nevertheless, the long-term goal of this approach is to replace the phrases used in decoding by translation units that predict a single target word, but may depend on several source words, previously translated target words and the reordering context.

Previous Work
Various approaches have been taken to compensate the downside of the phrase translation model. Mariño et al. (2006) introduce a translation model based on n-grams of bilingual word pairs, i.e. a bilingual language model (BILM), with an n-gram decoder that requires monotone alignments. In (Niehues et al., 2011), this is further advanced by BILMs operating on non-monotone alignments within a PBT framework.
However, this differs from our approach: BILMs treat jointly aligned source words as atomic units, ignore source deletions and do not include reordering context.
The Operation Sequence Model (OSM) introduced in (Durrani et al., 2011;Durrani et al., 2013a) includes n-grams of both translation and reordering operations in a consistent framework. It utilizes minimal translation units (MTUs) and is applied in a corresponding OSM decoder. Experiments in (Durrani et al., 2013b) show that a slightly enhanced version of OSM performs best when integrated into the log-linear framework of a phrase-based decoder. Both the BILM (Stewart et al., 2014) and the OSM (Durrani et al., 2014) can be smoothed using word clusters.
In comparison, the ETM is much simpler: Since it predicts probabilities of single words, it has a lower vocabulary size. More, it does not make use of reordering gaps, i.e. it utilizes a simpler reordering approach. The OSM uses one joint model for reorderings and translations. In contrast, the ETM incorporates separate models to estimate the probability of words and the probability of reorderings. Furthermore, the OSM has the drawback that it extracts the MTUs sentencewise, thus one word can appear in several MTUs extracted from different sentence pairs. Since an MTUs is treated as an atomic unit, this results in a distribution of probability mass on overlapping events. The ETM overcomes this drawback by operating on single words. Guta et al. (2015) propose the conversion of bilingual sentence pairs and word alignments into joint translation and reordering (JTR) sequences. They investigate n-gram models with modified Kneser-Ney smoothing, feed-forward and recurrent neural networks trained on JTR sequences. In comparison to the OSM, JTR models have smaller vocabulary sizes, as they operate on words, and incorporate simpler reordering structures. Nevertheless, they are shown to perform slightly better than the OSM when included into the log-linear framework of a phrase-based decoder.
Although our approach is similar, there are the following significant differences: On the one hand, the ETM estimates the probability of single words conditioned on an extended lexical and reordering context, whereas the JTR n-gram model predicts the probability of bilingual word pairs. On the other hand, we do not assume linear sequences of dependencies, but propose and explicit treatment of multiply aligned words. Deng and Byrne (2005) present an HMM approach for word-to-phrase alignments, which performs similar to IBM-4 on the task of bitext alignment and can also be applied for more powerful phrase induction.  introduce an reordering model based on sequence labeling techniques by converting the reordering problem into a a tagging task. Zhang et al. (2013) explore different Markov chain orderings for an n-gram model on MTUs. These are not integrated into decoding, but used in N-best rescoring. Another generative, word-based Markov chain translation model is presented by Feng and Cohn (2013). It exploits a hierarchical Pitman-Yor process for smoothing, but is only applied to induce word alignments. Their follow-up work (Feng et al., 2014) introduces a Markov-model on MTUs, similar to the OSM described above.
Finally, there has been recent research on applying neural network models for extended context (Le et al., 2012;Auli et al., 2013;Hu et al., 2014;Devlin et al., 2014;Sundermeyer et al., 2014). All of these papers focus on lexical context and ignore the reordering aspect covered in our work.

Extended Translation Models
Given a source sentence f J 1 and its translation e I 1 , EiTM models the inverse probability p( f J 1 |e I 1 ) and EdTM the direct probability p(e I 1 | f J 1 ). We allow for source words to be translated to multiple target words and vice versa. The inverted alignment b i denotes the sequence of source positions j aligned to target position i for i = 1, . . . , I. Its subsequence b < j i includes all source positions in b i preceding a given source position j: Unaligned target words are aligned to the empty source word f 0 , unaligned source words to the empty target word e 0 . b 0 denotes the unaligned source positions. We introduce the fertility φ i of a target word e i . It determines the number of source words aligned to the target word e i : By analogy, we use φ < j i to denote the number of source positions in b < j i . Similar to the approach in (Feng and Cohn, 2013), we generalize reorderings to the following jump classes ∆ φ i j , j : In the following, we depict the derivations of the EiTM and the EdTM. Although they operate in opposite translation directions, both models incorporate the inverted alignment b I 1 .

Extended Inverse Translation Model
In order to model the inverse probability p( f J 1 |e I 1 ), the unknown inverted alignment b I 1 is introduced as a hidden variable and approximated by the Viterbi alignment.
deletion probability The inverse probability has been decomposed into the deletion probability p( f b 0 | f b I b 1 , b I 1 , e I 1 ) and the joint probability p( f b I b 1 , b I 1 |e I 1 ). The latter is reformulated using the Markov chain rule: In order to restrict the history, we assume the probability of ( f b i , b i ) to be dependent only on the current target word e i , its last aligned predecessor e i , the corresponding alignment b i and the source words f b i : The conditional joint probability is factorized as resulting in the lexicon probability of f b i and the alignment probability of b i . In a nutshell, we have decomposed the inverse probability into the following three probabilities: • lexicon: Below, we show how to estimate these probabilities using the EiTM deletion, lexicon and alignment models.

EiTM: Deletion Model
Due to its artificiality, e 0 has no preceding target word. We condition the deletion of f b 0 only on e 0 and assume conditional independence between the unaligned source words f b 0 :

EiTM: Lexicon Model
Firstly, we apply the Markov chain rule to obtain the factorized probabilities of single words f j .
Each source word f j is dependent on all predecessors f b < j i aligned to the same target word e i and all previously aligned source words f b i . If we modelled the probability conditioned on the sets of source words f b i and f b < j i , this would lead to sparsity problems due to the arbitrary number of source words contained in the sets.
In order to avoid this, we therefore condition the probability on the individual words contained in Without any additional information, we assume all words f b i , f b < j i to be equally important for the prediction of f j . Thus, we average over the probabilities conditioned on: • all source words f j aligned to the preceding target word e i , • all preceding source words f¯j aligned to the current target word e i .
Moreover, we reduce the alignments (b i , b i ) to their corresponding jump classes. As a final result we obtain:

EiTM: Alignment Model
In principle, we follow the same derivation as for the lexicon model above. The probability of a source position j ∈ b i is computed as the average probability of a jump from a previously aligned source position, which either has to be aligned to the target predecessor i or is a preceeding source position aligned to the same target word e i .
Lines (1), (2), (3) and (7) are dependencies included in the EiTM but not in phrase translation models due to the phrase extraction heuristics. The dependency on multiple preceding word pairs is exemplified in (2) and (3). (4) depicts the insertion of the target word e 5 = up conditioned on the word pair (e 4 = us, f 7 = uns). Note that in (5) there is no dependency of e 6 = is on its predecessor e 5 = up and the empty word f 0 , but on its last aligned predecessor e 4 = us and the corresponding source word f 7 = uns. (6) shows an example of a source word aligned to multiple target words. The deletion probability of the source word f 3 = es is presented in (8).

Extended Direct Translation Model
So far, we have introduced the EiTM, which models the inverse translation probability p( f J 1 |e I 1 ). Besides modelling p( f J 1 |e I 1 ) using extended translation models, our aim is to employ them to model the direct probability p(e I 1 | f J 1 ) as well. For a start, the direct probability p(e I 1 | f J 1 ) can be modelled using the EiTM: Simply put, source and target corpora have to be swapped for the training of the EiTM. By doing so, the alignment has to be inverted as well, i.e. one has to use the direct alignment a j which denotes the sequence of target positions i aligned to source position j. As a result, the EiTM models p(e a J a 0 , a J 1 | f J 1 ) when trained with inverted corpora and alignments.
During the decoding process, the partial hypotheses are generated successively. Thus, for each target word e i that is hypothesized, all its predecessors have already been translated, i.e. its last aligned predecessor e i and the corresponding alignment b i and source words f b i are known.
Nevertheless, source words do not have to be translated in monotone order. In general, it cannot be guaranteed that the predecessor f j−1 of the first word f j of a source phrase has been translated yet. Therefore, the last aligned predecessor of f j and its aligned target words are generally unknown.
As a result, when applying the EiTM within phrase-based decoding for modelling the direct probability p(e I 1 | f J 1 ), dependencies beyond phrase boundaries cannot be captured.
Thus, we additionally develop the EdTM which models the direct translation probability p(e I 1 | f J 1 ). In comparison to the EiTM trained with swapped corpora and alignments, EdTM incorporates dependencies beyond phrase boundaries by keep-ing the inverted alignment b I 1 instead of using a J 1 . Analogue to the EiTM, the hidden alignment b I 1 is approximated by the Viterbi alignment.
Applying the Markov chain rule and assuming (e i , b i ) to be dependent only on the aligned source words f b i , the previously aligned target word e i as well as the corresponding alignment b i and the source words f b i , we obtain: We factorize the joint probability to obtain the lexicon probability of e i and the alignment probability of b i .
The direct probability has been decomposed into the following three probabilities.

EdTM: Deletion Model
The EdTM deletion model approximates the probability of e 0 conditioned on all unaligned source words f b 0 and is obtained by averaging over all unaligned source words:

EdTM: Lexicon Model
In contrast to the derivation of EiTM, the Markov chain rule cannot be applied at this point, since we do not model the probability of f b i , but the probability of e i conditioned on f b i . Thus, we average over all aligned source words f b i , which results in:

EdTM: Alignment Model
Applying the same assumptions as for the lexicon model, the EdTM alignment model results in:

Count Models and Smoothing
So far, we have introduced the ETM and shown how to include unaligned words and multiple word dependencies. However, there are various possibilities to train the lexicon and alignment probabilities derived in Subsections 3.1 and 3.2. As a starting point, we apply relative frequencies obtained from bilingual training data, where the Viterbi alignment is estimated using GIZA ++ (Och and Ney, 2003). In order to address data sparseness, we apply interpolated Kneser-Ney smoothing as described in (Chen and Goodman, 1998). In comparison to monolingual n-grams used in LMs, we lack any clear order of e, f , e , f and ∆, since they include bilingual and reordering information. Similar to the approach taken by Mariño et al. (2006), we model the probability of the bilingual word pair (e, f ) given its predecessor (e , f , ∆) which also includes the jump class. The EdTM lexicon model for dependencies on previously aligned target words is computed as where p(e, f |e , f , ∆) is the bigram distribution of (e, f ) given its predecessor (e , f , ∆) with interpolated Kneser-Ney smoothing. The denominator p(·, f |e , f , ∆) is obtained by marginalizing p(e, f |e , f , ∆) over all target words e. We follow the same approach for all other models in analogy.

Integration into Phrase-based Decoding
In this work, we apply a standard phrase-based translation system (Koehn et al., 2003). The decoding process is implemented as a beam search for the best translation given a set of models h m (e I 1 , s K 1 , f J 1 ). The goal of search is to maximize the log-linear feature score (Och and Ney, 2004): where s K 1 = s 1 . . . s K is the hidden phrase alignment. The feature weights λ m are tuned with minimum error rate training (MERT) (Och, 2003). The models h m , that are part of all baselines presented in this work, are phrasal and lexical translation scores in both directions, an n-gram LM, a simple distance-based distortion model and word and phrase penalties. All phrase pairs that are licensed by the word alignment are extracted from the training corpus and their probabilities estimated as relative frequencies. Moreover, the word alignment each phrase pair has been extracted from is memorized in the phrase table.
Our extended translation models are integrated into this framework as additional features h m . They are trained in both directions on a bilingual corpus and the Viterbi alignment, resulting in four additional features. When training in the Target→Source direction, the alignment direction is also swapped. Thus, EiTM and EdTM have the advantage of including context beyond phrase boundaries only when trained in the Source→Target direction.
To include the extended translation models into the phrasal decoder, the source position aligned to the last (not inserted) target word of the previously translated phrase has to be memorized in the search state of a partial hypothesis. Although this slightly affects hypothesis recombination and therefore leads to a larger search space, in practice it does not degrade the search accuracy, as experiments with relaxed pruning parameters have shown.

Evaluation
We perform experiments on the largescale IWSLT 2014 2 (Cettolo et al., 2014) German→English, English→French and the large-scale DARPA BOLT Chinese→English, Arabic→English tasks. As mentioned in Section 4, all baseline systems include phrasal and lexical smoothing scores trained in both directions. Word alignments are trained with GIZA ++ , by sequentially running 5 iterations each for the IBM-1, HMM and IBM-4 alignment models.
The domain of IWSLT consists of lecture-type talks presented at TED conferences which are also available online 3 . The baseline systems are trained on all provided bilingual data. All systems are optimized on the dev2010 and evaluated on the test2010 corpus. The ETM is trained on the TED portions of the data: 138K sentences for German→English and 185K sentences for English→French.
For German→English, to estimate the 4-gram LM, we additionally make use of parts of the Shuffled News, LDC English Gigaword and 10 9 -French-English corpora, selected by a crossentropy difference criterion (Moore and Lewis, 2010). In total, 1.7 billion running words are taken for LM training. For English→French, we use a large general domain 5-gram LM and an indomain 5-gram LM. Both are estimated with the KenLM toolkit (Heafield et al., 2013) using interpolated Kneser-Ney smoothing. For the general domain LM, we first select 1 2 of the English Shuffled News, 1 4 of the French Shuffled News as well as both the English and French Gigaword corpora by the same cross-entropy difference criterion. By concatenating this selection with all available remaining monolingual data, we build an unpruned LM.
The BOLT tasks are evaluated on the "discussion forum" domain. For Chinese→English, the baseline is trained on 4.08M general domain sentence pairs and the 5-gram LM on 2.9 billion running words in total. The ETM is trained on an indomain subset of 67.8K sentences and the test set contains 1844 sentences. For the Arabic→English BOLT task, we use only the in-domain data for training the baseline and the ETM. The training and test sets contain text drawn from discussion forums in Egyptian Arabic. The evaluation set contains 1510 bilingual sentence pairs.
The baseline systems for all tasks -except the Arabic→English BOLT task, where preliminary experiments showed no improvement -contain a 7-gram word cluster language model (Wuebker et al., 2013) and for comparison, we also experiment with a hierarchical reordering model (HRM) (Galley and Manning, 2008). When integrated into a phrase-based decoder, Durrani et al. (2013b) have shown the OSM to outperform bilingual LMs on MTUs. Therefore, we directly compare ourselves with a 7-gram OSM implemented into our phrasebased decoder as an additional feature. The OSM is trained on the same data as the ETM for all tasks. Bilingual data statistics for all tasks are shown in Table 1. For each system setting we evaluate three MERT runs using multeval (Clark et al., 2011). Results are reported in BLEU (Papineni et al., 2001) and TER (Snover et al., 2006). The optimization criterion for all experiments is BLEU.

Model parameters
To measure the complexity of the extended translation models in comparison to the phrase-based translation model, we count the number of parameters to be trained for each.   rameters to be trained for the ETM in total. This is slightly more than the 57M parameters for the phrase translation model.

Results
In order to compare the effect of the EiTM and EdTM used in a phrase-based decoder, we have trained the baseline including the HRM as described above on the full German→English bilingual data of the IWSLT task and the extended translation models on the TED data. The results evaluated on test2010 are shown in Table 3. Including the EiTM trained in both German→English and English→German directions into the phrasal decoder yields an absolute improvement of +0.7 BLEU and -1.0 TER, whereas including the EdTM yields +0.9 BLEU and -1.2 TER. This underlines that the EdTM is more suitable for translation than the EiTM because it predicts the direct probability of a target word, which corresponds to the actual translation direction. Note, that both EiTM and EdTM lose the advantage of modelling dependencies beyond phrase boundaries when trained in the inverse direction English→German. Therefore, we have evaluated their joint performance when trained only in German→English direction, which is similar to the performance of EdTM trained in both directions. This can be due to the fact that even though the EiTM trained in German→English direction incorporates dependencies beyond phrase boundaries, the EdTM trained in English→German direction profits from the better suited direct translation probability. The full ETM, i.e. EiTM and EdTM trained in both directions, yields the best overall performance gain of +1.1 BLEU and -1.1 TER over the baseline. Moreover, we evaluate the performance of the (full) ETM compared to the HRM and a 7-gram OSM, which are all introduced as additional features into the log-linear framework of the baseline phrase-based decoder. The results are presented in Table 4. The ETM performs similarly to the HRM for the Chinese→English and Arabic→English tasks, resulting in +0.3 BLEU over the PBT baseline. For both IWSLT tasks, the ETM outperforms the HRM by +0.7 BLEU, gaining +0.8 BLEU for the German→English and +1.0 BLEU for the German→English task over the PBT baseline. The context captured by the ETM corresponds roughly to the context captured by a 3-gram OSM. Bearing this in mind, we compare the ETM to a 7-gram OSM, which yields +0.25 BLEU more than the ETM averaged over the four language pairs. Comparing the OSM vocabulary of 1.5M words for the Arabic→English task to the 285K words in the Arabic corpus, this results in an ETM vocabulary 5-times smaller than the OSM vocabulary.
We also compare the ETM to the OSM on top of a PBT system that also includes the HRM, which is shown in the last two lines of Table 4. The performance of the ETM benefits from the information introduced by the HRM, as the gain of using the ETM is further increased by +0.15 BLEU on average. Overall, the ETM gains consistent and statistically significant improvements of +0.7 BLEU on average for all four language pairs over a state-of-the-art phrase-based decoder including the HRM. On the other hand, OSM seems to have a higher overlap with HRM, as the gain of OSM compared to ETM is reduced to +0.1 BLEU on average. Thus, on top of the phrase-based system including the HRM, the ETM including a bilingual word pair and the corresponding reordering jump class proves to be competitive to a 7-gram OSM.

Discussion
We have integrated two variants of a novel extended translation model into a state-of-the-art phrase-based decoder. The ETM captures lexical and reordering context beyond phrase boundaries in both the Source→Target and Target→Source directions. Further, the model potentially captures long-range reorderings and utilizes multiple and empty alignments, allowing for target insertions and source deletions. As an initial step, we have implemented the ETM using relative frequencies with interpolated Kneser-Ney smoothing. Its consistent and statistically significant improvement of up to +1.1 BLEU and -1.1 TER respectively +0.7 BLEU on average has been shown for four large-scale translation tasks, outperforming competitive phrase-based systems that include lexical and phrase translation models and hierarchical reordering models.
Compared to a 7-gram OSM, the ETM is much simpler in design: It uses a smaller vocabulary size, estimates the probability of single words instead of bilingual MTUs, avoids the need of reordering gaps and includes less lexical and reordering context, thus being less sparse. For all that, it performs competitively to a 7-gram OSM on top of phrase-based systems including the HRM. This fact underlines the advantages introduced by the ETM: It operates on words rather than MTUs, explicitly models multiple alignments instead of incorporating linear dependencies and models reorderings in a less complex way.
So far we have used the ETM as an additional feature in a phrase-based decoder, but we believe that the usage of such a decoder is a limitation. First, the ETM is estimated on alignments, which themselves are optimized for the IBM models. Second, decoding is performed using phrases that are extracted from the alignments using heuristics. Therefore, the potential of a phrase-based decoder is also limited by these heuristics.
Based on these facts, we believe that the ETM will show its full potential when it is also integrated into the training of the alignment, leading not only to a higher alignment quality, but also to a joint optimization of the alignments and the ETM. Further, directly applying the ETM within a wordbased decoder utilizing an extended translation and reordering context will redundantize phrases and thus any extraction heuristics. We believe that a consistent framework where the ETM is applied in both training the alignments and decoding will significantly advance machine translation.
For the short term, we will investigate better smoothing strategies and the possibilities of using neural networks instead of count models.