Hierarchical Machine Translation With Discontinuous Phrases

We present a hierarchical statistical machine translation system which supports discontinuous constituents. It is based on synchronous linear context-free rewriting systems (SLCFRS), an extension to synchronous context-free grammars in which synchronized non-terminals span k ≥ 1 continuous blocks on either side of the bitext. This extension beyond context-freeness is motivated by certain complex alignment conﬁgurations that are beyond the alignment capacity of current translation models and their relatively frequent occurrence in hand-aligned data. Our experiments for translating from German to English demonstrate the feasibility of training and decoding with more expressive translation models such as SLCFRS and show a modest improvement over a context-free baseline.


Introduction
In statistical machine translation, phrase-based translation models with a beam search decoder (Koehn et al., 2003) and tree-based models with a CYK decoder represent two prominent types of approaches. The latter usually employ some form of synchronous context-free grammar (SCFG). They can be grouped into so-called hierarchical phrase-based models that are formally syntaxbased, such as in Chiang (2007), and models where hierarchical units are somehow linguistically motivated, e.g. in Zollmann and Venugopal (2006) and Hoang and Koehn (2010).
The adequacy of all of these models has been questioned, as the space of alignments that they generate is limited. Inside-out alignments are beyond the alignment capacity of SCFG of rank 2 (henceforth 2-SCFG) and inversion transduc- tion grammar (Wu, 1997), but they can be generated with phrase-based translation models thanks to the reordering component of standard decoders. Cross-serial discontinuous translation units (CDTU) (Søgaard and Kuhn, 2009) and bonbon configurations (Simard et al., 2005) in contrast can neither be generated by a phrase-based translation system nor by an SCFG-based one. It is thereby assumed that a translation unit, the transitive closure of a set of nodes of the bipartite alignment graph, represents minimal translational equivalence, and therefore that an adequate translation grammar formalism should be able to generate each translation unit separately. The aforementioned problematic alignment configurations are schematically depicted in Figure 1. Alignment (i) is an inside-out alignment; it is formed by four translation units (a, b, c and d). CDTUs (ii) and bonbons (iii) each consist of two intertwined discontinuous translation units.
Several studies have investigated the alignment capacity of SCFG-based and phrase-based translation models in different setups (Wellington et al., 2006;Søgaard and Kuhn, 2009;Søgaard and Wu, 2009;Søgaard, 2010;Kaeshammer, 2013). For example, Wellington et al. (2006) find that inside-out alignments occur in 5% of their manually aligned English-Chinese sentence pairs. In the study of Kaeshammer (2013), 9% of the sentence pairs in a Spanish-French data set and 5.5% of the sentence pairs in an English-German data set cannot be generated by a 2-SCFG. In addition, Kaes-hammer and Westburg (2014) qualitatively investigate the instances of the complex alignment configurations in the same English-German data set and find that even though some of them are due to annotation errors, most of them are correctly annotated phenomena that one would like to be able to generate when translating.
To be able to induce the alignment configurations in question, more expressive translation models and corresponding decoding algorithms are necessary. For the phrase-based models, Galley and Manning (2010) propose a translation model that uses discontinuous phrases and a corresponding beam search decoder. For tree-based models, a grammar formalism beyond the power of context-free grammar is necessary. Søgaard (2008) proposes to apply range concatenation grammar; Kaeshammer (2013) puts forward the idea of using synchronous linear context-free rewriting systems (SLCFRS), a direct extension of SCFG to discontinous constituents. To the best of our knowledge, neither of the two proposals have resulted in an actual machine translation system.
With this work, we extend the line of research proposed in Kaeshammer (2013), and present the first full tree-based statistical machine translation system that allows for discontinuous constituents. It is thus able to produce the complex alignment configurations in Figure 1. As such, it combines the advantage of being able to learn and generate discontinuous phrases with the benefits of treebased translation models. Currently, our system is hierarchical phrasebased, i.e. it does not make use of linguistically motivated syntactic annotation. However, it will be straightforward to transfer methods to integrate linguistic constituency information from the SCFG-based machine translation literature (such as Zollmann and Venugopal (2006)) to our approach. This is particularly interesting, since, in the monolingual parsing community, approaches that are able to produce constituency trees with discontinuous constituents have become increasingly popular (Maier, 2010;van Cranenburgh and Bod, 2013;Kallmeyer and Maier, 2013). Recently, such parsers have reached a speed with which it would actually be feasible to parse the training set of a machine translation system (Versley, 2014;Maier, 2015;Fernández-González and Martins, 2015), which is necessary to train syntactically motivated translation grammars.
In this work, we define a translation model based on SLCFRS, explain the training of a corresponding hierarchical phrase-based grammar, provide details about a corresponding decoder and results of experiments for translating from German to English.

Model
Our translation model is a weighted synchronous LCFRS. Conceptually, this grammar formalism is very close to synchronous CFG, with the addition that non-terminals span tuples of strings (instead of just strings) on either side of the bitext. Just as SCFGs, an SLCFRS can be used for synchronous parsing of parallel sentences as well as for translating monolingual sentences. For the latter, the source side of the synchronous grammar is used to parse the input text, thereby generating target side derivations from which the translations can be read off.

Synchronous LCFRS
An LCFRS 1 (Vijay-Shanker et al., 1987;Weir, 1988) is a tuple G = (N, T, V, P, S) where N is a finite set of non-terminals with a function dim: N → N determining the fan-out of each A ∈ N ; T and V are disjoint finite sets of terminals and variables; S ∈ N is the start symbol with dim(S) = 1; and P is a finite set of rewriting rules , for a rank m ≥ 0. For all r ∈ P , it holds that every variable Y in r occurs exactly once in the left-hand side (LHS) and exactly once in the right-hand side (RHS) of r.
A non-terminal is instantiated with respect to some input string w such that terminals and variables are consistently mapped to w. A rule r explains how an instantiated LHS non-terminal can be rewritten by its instantiated RHS non-terminals. A derivation starts with the start symbol S instantiated to the input string w. All strings that can Figure 2: Rules of an SLCFRS for L = { a n b m c n d m , a n b m d m c n | n, m > 0}, taken from Kaeshammer (2013).
be rewritten to ε are in the language of the grammar. For more formal definitions, see for example Kallmeyer (2010).
The rank of a grammar G is the maximal rank of any of its rules, and its fan-out is the maximal fan-out of any of its non-terminals. G is called a (u, v)-LCFRS if it has rank u and fan-out v. A CFG is the special case of an LCFRS with fan-out v = 1. An LCFRS is monotone if, for every rule and every RHS non-terminal, the order of the variables in the arguments of this non-terminal is the same as the order of these variables in the arguments of the LHS non-terminal of this rule. This means that the order of (instantiated) arguments of the LHS non-terminal of a rule always corresponds to their order in the input sentence. An LCFRS is called ε-free if all of its rules in P are ε-free, which means that none of their LHS arguments is the empty string ε. 2 The definition of synchronous LCFRS (SLCFRS) follows the definition of synchronous CFG, as for example in Satta and Peserico (2005). An SLCFRS (Kaeshammer, 2013) is a tuple G = (N s , N t , T s , T t , V s , V t , P, S s , S t ) where N s , T s , V s , S s , resp. N t , T t , V t , S t are defined as for LCFRS. They denote the alphabets for the source and target side respectively. P is a finite set of synchronous rewriting rules r s , r t , ∼ where r s and r t are LCFRS rewriting rules based on N s , T s , V s and N t , T t , V t respectively, and ∼ is a bijective mapping of the non-terminals in the RHS of r s to the non-terminals in the RHS of r t . This link relation is represented by co-indexation in the synchronous rules. During a derivation, the yields of two co-indexed non-terminals have to be explained from one synchronous rule. S s , S t is the start pair. In such a derivation, we call the yield of S s the source side yield and the yield of S t the target side yield. SLCFRS are equivalent to 2 An LCFRS is also ε-free if it contains a rule S(ε) → ε, but S does not appear in any RHS of the rules in P . simple range concatenation transducers (Bertsch and Nederhof, 2001). Figure 2 shows an example.
The synchronous rules translate cross-serial dependencies into nested ones. A sample derivation is shown in Figure 3.
where P s is the set of all r s in P and P t is the set of all r t in P . The rank u of a SLCFRS G is the maximal rank of G s and G t , and the fan-out v of G is the sum of the fan-outs of G s and G t . One may write v v Gs |v G t to make clear how the fan-out of G is distributed over the source and the target side. As in the monolingual case, a corresponding grammar G is called a (u, v)-SLCFRS. The rank of the corresponding grammar in Figure 2 is 2 and its fan-out 4 2|2 . We call an SLCFRS monotone if the source side grammar as well as the target side grammar is monotone. We call an SLCFRS ε-free if the source side grammar as well as the target side grammar is εfree.
We further define some terms which will be used in the following sections. A range in a string w n 1 is a pair l, r with 0 ≤ l ≤ r ≤ n. Its yield l, r (w) is the string w r l+1 . The yield of a vector of ranges ρ(w) is the vector of the yields of the single ranges.

Definition
Given a source sentence f and an SLCFRS, generally, many derivations will have f as the source side yield, leading to many (different) target side yields, i.e. possible translations e. As it is standard in statistical machine translation, we use a loglinear model over derivations D to weight those translation options. The definition closely follows the model definition for SCFG, see Chiang (2007) for example.
where φ i are features defined on the derivations, and λ i are feature weights to be set during tuning. An n-gram language model provides a feature P LM (e) for the probability of seeing the target sentence e as derived by D. The other features (i = LM ) are defined on the rules of a weighted SLCFRS which are used in the derivation D.
A weighted SLCFRS is an SLCFRS that is additionally equipped with a weight function w which assigns a weight to each synchronous rule r ∈ P . To fit the log-linear model, we define w as The weight of a derivation D is then

Features
We use the following standard features φ i (r): • translation probabilities in both directions P (r s |r t ) and P (r t |r s ), • lexical weights lex(r s |r t ) and lex(r t |r s ) (Koehn et al., 2003) that estimate how well the terminals in the rule translate to each other, • a rule penalty exp(1), • a word penalty exp(−|w t |) where |w t | is the number of terminals that occur in r t .
In addition, we devise features that characterize the amount of expressivity beyond contextfreeness of the applied rules. The source gap degree of r is the fan-out of r s minus 1, and the target gap degree of r is the fan-out of r t minus 1. See Maier and Lichte (2011) for more details about gap degree. These features can be read off the rules r directly. They allow the model to learn a preference for or against using the more powerful rules.
We also use glue rules, as proposed by Chiang (2005), which allow for a monotone combination of synchronous constituents as in a phrase-based model. A glue rule feature of value exp(1) with its weight λ glue controls their usage.

Training
The synchronous rules are extracted from a corpus of parallel sentences that have already been word-aligned. Following Och and Ney (2004) and Chiang (2005), we extract all rules that are consistent with the word alignment A of a sentence pair f, e in a two-step procedure. First, initial phrase pairs are extracted; they correspond to terminal rules. Second, hierarchical rules are created by replacing phrase pairs that are contained within other phrase pairs with non-terminals/variables.
The crucial difference to previous work on translation with SCFG is that initial phrases do not have to be continuous. Instead, a phrase is a set of word indices, as in Galley and Manning (2010). Given f, e and a corresponding word alignment A, a phrase pair (s,t) is consistent with A if the following holds: For each initial phrase pair (s,t), a terminal synchronous rule of the following form is created and added to P : ) → ε ρ s and ρ t are range vectors, applied to the source sentence f and target sentence e respectively. ρ s (respectively ρ t ) is obtained by partitionings (respectivelyt) such that each subset contains all and only consecutive indices, designating a continuous block of the discontinuous phrase. Such a subset X is turned into a range l, r with l = min(X) and r = max(X). The ranges obtained froms (respectivelyt), in ascending order, form ρ s (respectively ρ t ).
Furthermore, if P contains a rule X(α) → Ψ, X(β) → Θ that has been built from a phrase pair (s,t) and the set of phrase pairs contains a pair (s ,t ) such thats ⊂s andt ⊂t, we add the following new rule to P : A new non-terminal X is added to the RHS of r s and r t . k is an index that is not yet used in the bijective mapping of non-terminals in Ψ and Θ. Range vectors ρ s and ρ t are deduced froms andt as described above. Each range in ρ s (respectively ρ t ) is associated with a variable Y i for
where h s (respectively h t ) is the length of ρ s (respectively ρ s ). They have to be variables that are not yet in use in α (respectively β). Those variables constitute the arguments of the new synchronous non-terminal X. Accordingly, h s and h t are the fan-outs of X on the source and the target side respectively. α (respectively β ) is created from α (respectively β) by replacing the terminals that correspond to ranges in ρ s (respectively ρ t ) with the variable Y i (respectively Z j ) that as been associated to the range. Note that this extraction yields only monotone and ε-free (S)LCFRS, which simplifies parsing. The discontinuous rule extraction procedure is exemplified in Figure 4. Rule #5 for example was created from rule #3 by substituting phrase pair #2. Note that phrase pairs #1 and #4 are also extracted by a phrase-based system, and rules #1, #4 and #6 are also generated by a hierarchical phrase-based, i.e. SCFG-based, system. Rule #6 would usually be written down as X → ne veux plus X 1 , do not want to X 1 anymore However, just as Galley and Manning (2010), we extract many more rules that also capture discontinuous translation units. In addition, we also extract rules which are discontinuous and hierarchical at the same time. They capture relationships between possibly discontinuous translation units.
Enumerating all discontinuous phrase pairs is exponential in the maximum phrase length. Therefore, in addition to the constraints that are generally set for SCFG extraction (e.g. phrase length, number of non-terminals, adjacent non-terminals on the source side, unaligned words at phrase edges, see Chiang (2007)), we also restrict the number of words that can be in a gap, we disallow unaligned blocks, and we restrict the number of continuous blocks in a phrase to 2. The latter is motivated by the results presented in Kaeshammer (2013) where a fan-out of 4 2|2 is enough to derive the alignments in all data sets. We furthermore analyse the alignments of the training data before running the extraction and only allow discontinuous phrase pairs in synchronous spans which contain any of the alignment configurations that are beyond the power of SCFG.
As derivations are not observable in the training data, we use the method described in Chiang (2007) to hypothesize a distribution based on the counts of the extracted rules and then use relative-frequency estimation to obtain P (r s |r t ) and P (r t |r s ).

Decoder
Our decoder closely follows the methodology of current SCFG decoders, with the difference that it is able to handle source and target discontinuities in the form of SLCFRS rules. The goal is to find the target sequence e of the highest scoring derivation D according to the model defined in Section 2.2 that yields f, e , where f is the given input sentence.
We parse the input sentence with a bottom-up CYK parser using the source side of the SLCFRS translation grammar. This corresponds to monolingual probabilistic LCFRS parsing, which has been described for example in Kallmeyer and Maier (2013). Using the rules, parse items are built. They are of the form [A, ρ], where A is a non-terminal label and ρ is a range vector indicating which part of the input is covered by this item. For the label, we use a combination of the source side label and the target side label in order to ensure valid target side derivations. Smaller items, i.e. items that cover less input words, are created before larger items. Equal items are combined, thereby retaining their origin via hyperedges.
When creating a new item using a specific rule, the variables and arguments in the rule have to be replaced consistently with ranges l, r of the input sentence. Roughly, this means that terminals and variables are instantiated with ranges such that for ranges that are adjacent in an argument of the LHS non-terminal, the concatenation of the two ranges has to be defined, i.e. r 1 = l 2 for l 1 , r 1 and l 2 , r 2 . For example, given the input 0 il 1 ne 2 mange 3 plus 4 , X( 1, 4 ) → X( 2, 3 ) is an instantiation of the source side of rule #5 from Figure 4. We can make further assumptions about rule instantiations, as our rules are all monotone, ε-free and we do not allow for empty gaps to avoid spurious ambiguity.
In the implementation, we first replace all terminals with all possible ranges with respect to the input sentence in an initialisation step; for instance X( 1, 2 Y 1 3, 4 ) → X(Y 1 ) for the previous example. During the actual parsing, we are then only concerned with how variables are instantiated. We implement different pruning methods, such as limiting the number of target side rules for the same source side rule, and limiting the number of incoming hyperedges for one parse item.
Because of the specific form of the grammar that we have extracted (rank 2, fan-out 4 2|2 ), we implement a specific parser for (2, 2)-LCFRS. Accordingly, the range vector ρ of an item has the form i 1 , j 1 , i 2 , j 2 , where i 2 and j 2 are undefined if the yield of the item is continuous. Such range vectors can be stored and retrieved more efficiently than general range vectors, i.e. for full LCFRS (which are typically implemented as bit vectors of the size of the input sentence). Also parsing time complexity is directly dependent on the fan-out v s of the monolingual grammar: O(|G s | · |f | vs·(u+1) ) with rank u = 2 and fan-out v s = 2 in our case.
Finally, the parse hypergraph that we obtain from parsing with the source side of the grammar is intersected with an n-gram language model to also integrate P LM (e). We use cube pruning for this step (Chiang, 2007;Huang and Chiang, 2007). The difference to SCFG-based implementations is that the target string of a hypothesis that is scored by the language model is not necessarily continuous, but consists of a tu-ple of continuous blocks of target words, e.g. do not want, anymore if we would like to score a hypothesis which has been built from rule #3 in Figure 4. Therefore, each continuous block is scored separately and contributes its score to the overall score of the hypothesis. Furthermore, we need to store one language model state (simply put remembering the first and last n − 1 words of the block) for each block. This means that a language model state in our implementation is a vector of conventional language model states of the length of the size of the target tuple of the hypothesis. Note that since our grammar has a target fan-out of 2, this vector has a maximal length of 2, but this is not a fixed limit in the implementation.
Since obtaining the k-best translations for a given input sentence is essential for tuning, we implement k-best extraction on the hypergraph that we obtain after cube pruning. We adopt the lazy strategy from Huang and Chiang (2005).
The decoder is implemented in C++, including code from KenLM 3 for language modelling.

Setup
We run experiments for German-to-English, based on data that has been used in the WMT 2014 translation task 4 . For training of the translation models, we use the parallel sentences from Europarl and the News Commentary Corpus up to a length of 30 words (1.3M sentence pairs). For language modeling, we use the KenLM Language Model Toolkit 5 . We train a 3-gram language model on all available monolingual English data (Europarl,News Commentary,News Crawl,92.7M sentences). From the available development data, we use newstest2013 as the development test set (max. 25 words). From the rest, we randomly select 3000 sentence pairs of a maximal length of 25 words as development set. We further refine this set to sentences without out-ofvocabulary source words by decoding the development set once and selecting the corresponding sentences. We thus end up with 1694 sentence pairs for tuning. As our test set, we use the cleaned test set that has been made available (2280 sentence pairs with a maximal length of 30 words).
We normalize the punctuation, tokenize and truecase all our data using the scripts that are available in Moses 6 (Koehn et al., 2007). Furthermore, we perform compound splitting for German, also with the script provided in Moses.
The training data is word-aligned by running multi-threaded GIZA++ in both directions and then symmetrizing the alignments using the grow-diag-final-and heuristics as implemented in the Moses training script (step 1-4). Lexical translation probabilities are also emitted as part of this pipeline. For grammar extraction, we limit the length of initial phrases and the number of words in a gap to 10. We neither allow unaligned words at edges of initial phrases nor unaligned blocks.
Before decoding a data set with our decoder, we filter the large translation grammar with respect to the input data by extracting per-sentencegrammars. These only contain rules whose terminals match the words in the sentence to translate.
For the reported results, we set the buffer size for cube pruning to 400. We do not limit the number of words a non-terminal can span. We neither restrict the number of incoming hyperedges for the parse items nor the number of target side rules for the same source side rule.
Tuning the feature weights is done with minimum error rate training (Och, 2003), maximizing BLEU-4 (Papineni et al., 2002) and using the 200 best translations. For our own decoder, we use the very flexible implementation Z-MERT v1.50 (Zaidan, 2009). For Moses, we use the provided tuning script mert-moses.pl.
All reported BLEU scores have been calculated with the Moses script multi-bleu.perl, using the lowercase option -lc. Because of the variance that is introduced by tuning, we repeated each experiment four times and report the average of the final BLEU scores as well as the standard deviation.

Results
We compare different versions of our system against each other. The baseline is a system which uses only SCFG rules, i.e. a hierarchical phrasebased system. We refer to it as SYS(1,1), as it uses an SLCFRS of fan-out 2 1|1 . SYS(1,2) is a system which uses a grammar of fan-out 3 1|2 , i.e. it builds only continuous constituents on the source side,  Table 1: Averaged BLEU scores over four tuning runs; the feat column indicates whether additional source/target gap degree features have been used but allows for discontinuous constituents with two blocks on the target side. SYS(2,1) is the analogous system which restricts the target side to continuous constituents. Finally, SYS(2,2) uses an SLCFRS of fan-out 4 2|2 . Table 1 displays the main results. Allowing gaps on the source and the target side (SYS(2,2)) leads to a decline in BLEU score compared to the baseline. We hypothesize that this is due to weak probability estimates because of data sparseness and the additional ambiguity that is caused by the new rules with discontinuities. However, when adding the features about the gap degree of the rules used in the derivation, the model has an additional way of influencing which kind of rules are used. Especially controlling for the target gap degree turns out to be important and leads to a small improvement in BLEU score. Note, however, that rules with target gaps are not totally dismissed when this feature is switched on. Usage of rules with a target gap goes down from on average 734.5 rules in SYS(2,2) to on average 76.5 rules in SYS(2,2)-T in the test set. They are used less often, but, it seems, in a more controlled and sensible way.
This tendency is further confirmed with the experiments in which the discontinuous rules are only used on one side. While restricting the source side derivations to continuous yields does not improve the BLEU score (it rather severely degrades it in the case of the devtest set), restricting the target side derivations leads to a small improvement in BLEU score, and even to the best system for the test set. This is in particular interesting with respect to translation times since restricting the target side to continuous yields means removing the additional complexity that target gaps mean for the   We also report results for the hierarchical phrase-based system in Moses trained on the same data as our systems. We tried to use the same settings as for our comparable system SYS(1,1). However, given the number of parameters during training and decoding, the various interpretations thereof and numerous implementation details to consider, it is not too surprising that the Moses system actually produces different translations than ours. The reported numbers merely serve as a point of reference, indicating that the translations produced by our system are not totally far off.

Manual Evaluation
We furthermore performed a manual evaluation in form of a system comparison using our own installation of the Appraise tool (Federmann, 2012). We compare the baseline SYS(1,1) against SYS(2,1), the best-performing setup on the test set. For each of them, we randomly selected one of the four configurations that lead to the reported averaged BLEU score. We then selected those translations of the test set where SYS(2,1) uses at least one SLCFRS rule with a discontinuity (95 sentences).
We asked two native speakers of English (e1, e2) with basic knowledge of German to evaluate our test sentences. They were shown the source sentence, a reference translation, the SYS(1,1) translation and the SYS(2,1) translation. The latter two were presented anonymized and in random order. The options for the evaluators were (a) translation A is better than B, (b) translation B is better than A, and (c) translations A and B are of equal quality. We specifically asked them to use option (c) as rarely as possible. Table 2 shows the results. While our human evaluators do not demonstrate a clear preference for one of the systems, there is, however, a slight preference for the system that uses discontinuous rules (SYS(2,1)). In spite of the inter-annotator agreement being not very high (Cohen's κ = 0.338), the tendency for SYS(2,1) is also perceivable for the translations for which the evaluators agree in their decisions, see Table 3.

Translation Example
We finish this section with an actual translation example. It is picked because it makes crucial use of the discontinuous SLCFRS rules. It is taken from the test set. In Figure 5, the following rule, which has a fanout of 2 on the source side, leads to an overall grammatical sentence structure and a meaningful translation: The rule derives the synchronous constituent labelled X 4 in Figure 5. Besides providing a correct verbal translation in a specific tense, it also establishes a relationship to the adjective (X 1 ) and the infinitive subordinate clause (X 2 ), thereby still leaving room for the adverb in terms of the gap on the source side. The adverb is then introduced with the following rule, leading to the constituent labelled X 5 in Figure 5: This rule can be seen as capturing the different placement of the adverb auch/also in German and English.
Note that the alignment that is induced by the SYS(2,1) derivation is also derivable with a 2 1|1 -SLCFRS. One general possibility is to allow rules of rank u > 2. Another possibility is to put the individual phrases together in a different order and hierarchy. For example, in an SCFG rule, the discontinuous verb phrase could be combined with the adjective and the adverb first, which leads to a continuous constituent. Then the subordinate clause would be added in a later derivation step. However, in the derivation for the best translation of SYS(1,1), this does not happen because a corresponding specific rule has not been learned. The translation produced by SYS(1,1) is not grammatical and misses important concepts, such as geeignet (suitable).

Related Work
Several other translation models have been proposed which are expressive enough to generate the complex alignment configurations in Figure 1. Most notably, Galley and Manning (2010) propose a phrase-based translation system which allows for discontinuous phrase pairs, building upon the idea of a translation model proposed by Simard et al. (2005). They evaluate their system on a Chineseto-English translation task and achieve some improvement in BLEU score over a phrase-based and a hierarchical phrase-based system. Unfortunately, we could not evaluate directly against their approach since the current documentation 7 of their system, Phrasal (Green et al., 2014), does not mention the discontinuous phrases anymore. We also could not obtain the data sets they used for their experiments.
In some sense, our work is the hierarchical, treebased counterpart to the phrase-based approach of Galley and Manning (2010). This means that our translation grammar rules unify two types of "gaps" of previous approaches: (a) gaps in the sense of non-terminals that are inserted into longer phrases when hierarchical rules are created, as in Chiang (2007); their purpose is a better generalization of the translation rules, and (b) gaps in the 7 http://www-nlp.stanford.edu/wiki/ Software/Phrasal, accessed on June 27, 2015 sense of discontinuities in the yield of a translation rule, on the source side, on the target side or both, driven by the idea of allowing for more flexible phrases such that generated alignment structures are not restricted.
Besides the suggestion of Kaeshammer (2013) to use SLCFRS as the translation grammar formalism, which we have detailed and implemented in this work, Søgaard (2008) proposes to apply range concatenation grammar, an even more expressive formalism than LCFRS, and to use its ability to copy substrings during the derivation. This approach has downsides, such as no tight probabilities estimators, which are mentioned in Søgaard and Kuhn (2009).
An early advocate of translation modeling beyond context-free grammar formalisms is Melamed, who proposes to use Generalized Multitext Grammars, which are weakly equivalent to LCFRS . The incentive for this lies in linguistically motivated translation grammars and the general observation that discontinuous constituents are necessary for monolingual modelling of syntax.

Conclusions and Future Work
With this work, we extend the hierarchical phrasebased machine translation approach to discontinuous phrases, using SLCFRS as the translation grammar formalism. Since SLCFRS is a direct extension to SCFG, previous work on hierarchical phrase-based translation, in particular the model definition, training and decoding, can be extended to SLCFRS in a more or less direct manner. Evaluating our new system on a German-to-English translation task revealed a modest improvement in BLEU score over the SCFG baseline. Human evaluators showed a slight preference for translations produced by the SLCFRS system.
In the future, we will evaluate our approach on other language pairs, for example Chinese-English which has been used in related work. Furthermore, we would like to make use of recent advances in monolingual parsing of discontinuous constituents and use phrase-structure trees supporting discontinuous constituents for tree-based machine translation.