Source-Side Left-to-Right or Target-Side Left-to-Right? An Empirical Comparison of Two Phrase-Based Decoding Algorithms

This paper describes an empirical study of the phrase-based decoding algorithm proposed by Chang and Collins (2017). The algorithm produces a translation by processing the source-language sentence in strictly left-to-right order, differing from commonly used approaches that build the target-language sentence in left-to-right order. Our results show that the new algorithm is competitive with Moses (Koehn et al., 2007) in terms of both speed and BLEU scores.


Introduction
Phrase-based models (Koehn et al., 2003;Och and Ney, 2004) have until recently been a stateof-the-art method for statistical machine translation, and Moses (Koehn et al., 2007) is one of the most used phrase-based translation systems. Moses uses a beam search decoder based on a dynamic programming algorithm that constructs the target-language sentence from left to right (Koehn et al., 2003). Neural machine translation systems (Kalchbrenner and Blunsom, 2013;Cho et al., 2014;Sutskever et al., 2014), which have given impressive improvements over phrase-based systems, also typically use models and decoders that construct the target-language string in strictly leftto-right order.
Recently, Chang and Collins (2017) proposed a phrase-based decoding algorithm that processes the source-language string in strictly left-to-right order. Reordering is implemented by maintaining multiple sub-strings in the target-language, with phrases being used to extend these sub-strings by various operations (see Section 2 for a full description). With a fixed distortion limit on reordering, * On leave from Columbia University. the time complexity of the algorithm is linear in terms of sentence length, and is polynomial time in other factors. Chang and Collins (2017) present the algorithm and give a proof of its time complexity, but do not describe experiments, leaving an open question of whether the algorithm is useful in practice. This paper complements the original paper by studying the algorithm empirically. In addition to an exact dynamic programming implementation, we study the use of beam search with the algorithm, and another pruning method that restricts the maximum number of target-language strings maintained at any point. The experiments show that the algorithm is competitive with Moses in terms of both speed and translation quality (BLEU score).
The new decoding algorithm is of interest for a few reasons. While the experiments in this paper are with phrase-based translation systems, the method could potentially be extended to neural translation, for example with an attention-based model that is in some sense monotonic (left-toright). The decoder may be relevant to work on simultaneous translation (He et al., 2016). The ideas may be applicable to string-to-string transduction problems other than machine translation.
2 A Sketch of the Decoding Algorithm of Chang and Collins (2017) This section gives a sketch of the decoding algorithm of Chang and Collins (2017). We first define the phrase-based decoding problem, and then describe the algorithm.
of target-language words e = e 1 . . . e m . We use s(p), t(p), and e(p) to refer to the three elements of a phrase p. A derivation is a sequence of L phrases, p 1 . . . p L . The derivation gives a translation by concatenating the target-language strings e(p 1 ) . . . e(p L ).
We will always assume that x 1 = <s>, the start-of-sentence symbol, and x n = </s>, the end-of-sentence symbol. The only phrases covering positions 1 and n are (1, 1, <s>) and (n, n, </s>).
A derivation p 1 . . . p L is valid if each word in the source sentence is translated exactly once, and The score for any derivation is where the parameter η is the distortion penalty, λ(e) is a language model score for the word sequence e, and κ(p) is the score for phrase p under the phrase-based model. For example under a bigram language model, we have λ(e 1 . . . e m ) = m i=2 λ(e i |e i−1 ). where λ(v|u) is the score for bigram (u, v).
The phrase-based decoding problem is to find arg max where P is the set of all valid derivations for the input sentence.

The Decoding Algorithm
At a high level, the decoding algorithm of Chang and Collins (2017) differs from the commonlyused approach of Koehn et al. (2003) in two important respects: 1. The decoding algorithm proceeds in strictly left-to-right order in the source sentence.
2. Each sub-derivation (item) in the beam consists of multiple sequences of phrases, instead of a single sequence.
To be more precise, each sub-derivation in the decoding algorithm consists of: 1. An integer j specifying the length of the derivation (i.e., that words x 1 . . . x j have been translated).
2. Second, extend the derivation using one of the following operations (we use CONCAT to denote an operation that concatenates two or more phrase sequences): (a) Replace π i for some i ∈ 1 . . . r by CONCAT(π i , p).
(c) Replace π i , π i for integers i = i by CONCAT(π i , p, π i ) (d) Create a new segment π r+1 = p . Figure 1 shows the sequence of steps, and the resulting sequence of sub-derivations, in the translation of a German sentence.
A few remarks: Remark 1. The score for each of the operations (a)-(d) described above is easily calculated using a combination of phrase, language model, and distortion scores.
Remark 2. The distortion limit can be used to rule out some of the operations (a)-(d) above, depending on the phrase p and the start/end points of each of the segments π 1 . . . π r .
Remark 3. Dynamic programming can be used with this algorithm. Under a bigram language model, the dynamic programming state for a subderivation (j, {π 1 . . . π r }) records the words and positions at the start and end of each segment π 1 . . . π r . For example under a bigram language model the sub-derivation (7, . . .) in Eq. 1 would be mapped to the dynamic-programming state (7, {(1, <s>, 4, also), (5, these, 7, seriously)}). See Chang and Collins (2017) for more details.
Remark 4. It is simple to use beam search in conjunction with the algorithm. Different derivations of the same length j are compared in the beam. A heuristic-typically a lower-order language model-can be used to score the first n − 1 words in each segment π 1 · · · π r : this can be used as the "future score" for each item in the beam. This is arguably simpler than the future scores used in (Koehn et al., 2003), which have to take into account the fact that different items in the beam correspond to translations of different subsets of words in the source sentence. In our approach different derivations of the same length j have translated the same set of words x 1 · · · x j . For example in the sub-derivation (7, . . .) given above (Eq. 1), and given a trigram language model, the initial bigram these criticisms in π 2 is scored as p u (these) × p b (criticisms|these) where p u and p b are unigram and bigram language models.

Experiments
The original motivation for Chang and Collins (2017) was to develop a dynamic-programming algorithm for phrase-based decoding that for a fixed distortion limit d was polynomial time in other factors: the resulting dynamic programming algorithm is O(nd!lh d+1 ) time, where d is the distortion limit, l is a bound on the number of phrases starting at any position, and h is related to the maximum number of different target translations for any source position. However an open question is whether the algorithm is useful in practice when used in conjunction with beam search. This   Koehn et al. (2003). Throughout this section we refer to the algorithm of Chang and Collins (2017) as the "new" decoding algorithm.
Data. We use the Europarl parallel corpus (Version 7) 3 (Koehn, 2005) for all language pairs except for Vietnamese-English (vi-en). For Czech-English (cs-en), we use the Newstest2015 as the development set and Newstest2016 as the test set. For European languages other than Czech, we use the development and test set released for the Shared Task of WPT 2005 4 . For vi-en, we use the IWSLT'15 data.

Search Space with a Bigram Model
We first analyze the properties of the algorithm by running the exact decoding algorithm with a bigram language model and a fixed distortion limit of four, with no pruning. In Figure 2, we plot the number of transitions computed versus sentence length for translation of 2,000 German sentences to English. The figure confirms that the search space grows linearly with the number of words in the source sentence.

Beam Search under the New Algorithm
Even though the exact algorithm is linear time in the input sentence length, other factors (the depen-  dence on d, l, and h, as described above) make the exact algorithm too costly to be useful in practice.
We experiment with beam search under the new algorithm, 5 both with and without further pruning or restriction. We experimented with a segment constraint on the new algorithm: more specifically, we describe experiments with a hard limit r ≤ 2 on the number of segments π 1 . . . π r used in any translation. Figure 3 shows results using a trigram language model for the new algorithm with beam search (SegmentD), the new algorithm with beam search and a hard limit r ≤ 2 on the number of segments (Segment2), and Moses. A beam size of 100 is used with all the algorithms. For each language pair, we pick the distortion limit that maximizes the BLEU score for Moses. Moses was used to train all the translation models. It can be seen that the Segment2 algorithm gives very similar performance to Moses, while SegmentD has inferior performance for languges which require a larger distortion limit.

Experiments on the Number of Segments
Required for German-to-English Translation Finally, we investigate empirically how many segments (the maximum value of r) are required for translation from German to English. In a first experiment, we use the system of Chang and Collins (2011) to give exact search for German-to-English translation under a trigram language model with a distortion limit d = 4, and then look at the maximum value for r for each optimal translation. Out of 1,821 sentences, 34.9% have a maximum value of r = 1, 62.4% have r = 2, and 2.69% have r = 3 (Table 4a). No optimal translations require a value of r greater than 3. It can be seen that very few translations require more than 2 segments. In a second experiment, we take the reordering system of Collins et al. (2005) and test the maximum value for r on each sentence to capture the reordering rules. Table 4b gives the results. It can be seen that over 99% of sentences require a value of r = 3 or less, again suggesting that for at least this language pair, a choice of r = 3 or r = 4 is large enough to capture the majority of reorderings (assuming that the rules of Collins et al. (2005) are comprehensive).

Conclusion
The goal of this paper was to understand the empirical performance of a newly proposed decoding algorithm that operates from left to right on the source side. We compare our implementation of the new algorithm with the Moses decoder. The experimental results demonstrate that the new algorithm combined with beam search and segmentbased pruning is competitive with the Moses decoder. Future work should consider integration of the method with more recent models, in particular neural translation models.