Exact Decoding with Multi Bottom-Up Tree Transducers

We present an experimental statistical tree-to-tree machine translation system based on the multi-bottom up tree transducer including rule extraction, tuning and decoding. Thanks to input parse forests and a “no pruning” strategy during decoding, the obtained translations are competitive. The drawbacks are a restricted coverage of 70% on test data, in part due to exact input parse tree matching, and a relatively high runtime. Advantages include easy redecoding with a different weight vector, since the full translation forests can be stored after the ﬁrst decoding pass.


Introduction
In this contribution, we present an implementation of a translation model that is based on -XMBOT (the multi bottom-up tree transducer of Arnold and Dauchet (1982) and Lilin (1978)). 1 Intuitively, an MBOT is a synchronous tree sequence substitution grammar (STSSG, Zhang et al. (2008a); Zhang et al. (2008b); Sun et al. (2009)) that has discontiguities only on the target side (Maletti, 2011). From an algorithmic point of view, this makes the MBOT more appealing than STSSG as demonstrated by Maletti (2010). Formally, MBOT is expressive enough to express all sensible translations (Maletti, 2012) 2 . Figure 2 displays sample rules of the MBOT variant, called -XMBOT, * This work was supported by Deutsche Forschungsgemeinschaft grant MA/4959/1-1. 1 The system presented in this paper is variant of the system presented at last year's workshop (Quernheim and Cap, 2014), without morphological enhancements.
2 A translation is sensible if it is of linear size increase and can be computed by some (potentially copying) top-down tree transducer. that we use (in a graphical representation of the trees and the alignment). Recently, a shallow version of MBOT has been integrated into the popular Moses toolkit (Braune et al., 2013). Our implementation is exact in the sense that it does absolutely no pruning during decoding and thus preserves all translation candidates, while having no mechanism to handle unknown structures. (We added dummy rules that leave unseen lexical material untranslated.) The coverage is thus limited, but still considerably high. Source-side and targetside syntax restrict the search space so that decoding stays tractable. Only the language model scoring is implemented as a separate reranker. This has several advantages: (1) We can use input parse forests (Liu et al., 2009). (2) Not only is the output optimal with regard to the theoretical model, also the space of translation candidates can be efficiently stored as a weighted regular tree grammar. The best translations can then be extracted using the k-best algorithm by Huang and Chiang (2005). Rule weights can be changed without the need for explicit redecoding, the parameters of the log-linear model can be changed, and even new features can be added. These properties are especially helpful in tuning, where only the k-best algorithm has to be re-run in each iteration. A model in similar spirit has been described by Huang et al. (2006); however, it used target syntax only (using a top-down tree-to-string transducer backwards), and was restricted to sentences of length at most 25. We do not make such restrictions.
The theoretical aspects of -XMBOT and their use in our translation model are presented in Section 2. Based on this, we implemented a machine translation system that we are going to make available to the public. Section 4 presents the most important components of our -XMBOT implemen-tation, and Section 5 presents our submission to the WMT15 shared translation task.

Theoretical Model
In this section, we present the theoretical generative model that is used in our approach to syntaxbased machine translation: the multi bottom-up tree transducer (Maletti, 2011). It is a variant of the linear and nondeleting extended multi bottomup tree transducers without states. We omit the technical details and give graphical examples only to illustrate how the device works, but refer to the literature for the theoretical background. Roughly speaking, a local multi bottom-up tree transducer ( MBOT) has rules that replace one nonterminal symbol N on the source side by a tree, and a sequence of nonterminal symbols on the target side linked to N by one tree each. These trees again have linked nonterminals, thus allowing further rule applications.
Our MBOT rules are obtained automatically from data like that in Figure 1. Thus, we (word) align the bilingual text and parse it in both the source and the target language. In this manner we obtain sentence pairs like the one shown in Figure 1. To these sentence pairs we apply the rule extraction method of Maletti (2011). The rules extracted from the sentence pair of Figure 1 are shown in Figure 2. Note the discontiguous alignment of went to ist and gegangen, resulting in discontiguous rules.
The application of those rules is illustrated in Figure 3 (a pre-translation is a pair consisting of a source tree and a sequence of target trees). While it shows a synchronous derivation, our main use case of MBOT rules is forward application or input restriction, that is the calculation of all target trees that can be derived given a source tree. For a given synchronous derivation d, the source tree generated by d is s(d), and the target tree is t(d).
The yield of a tree is the string obtained by concatenating its leaves.
The theoretical justification for decomposing the translation model into a source model and a target model is a theorem that states that every MBOT can be replaced by a composition of a linear nondeleting extended top-down tree transducer (XTOP) and a linear homomorphic MBOT (Engelfriet et al., 2009). We implemented the first step of the composition as an XTOP that generates possible derivation trees. States in this de-vice are linked nonterminals in the MBOT rules, and it translates left-hand sides into rule identifiers. The second step is implemented as a homomorphic multi bottom-up tree transducer. While we construct the first step of the composition explicitly, we only use the second device to evaluate single trees.
Apart from MBOT application to input trees, we can even apply MBOT to parse forests and even weighted regular tree grammars (RTGs) (Fülöp and Vogler, 2009). RTGs offer an efficient representation of weighted forests, which are sets of trees such that each individual tree is equipped with a weight. This representation is even more efficient than packed forests  and moreover can represent an infinite number of weighted trees. The most important property that we utilize is that the output tree language is regular, so we can represent it by an RTG (cf. preservation of regularity (Maletti, 2011)). Indeed, every input tree can only be transformed into finitely many output trees by our model, so for a given finite input forest (which the output of the parser is) the computed output forest will also be finite and thus regular.

Translation Model
Given a source language sentence e and corresponding weighted parse forest F (e), our translation model aims to find the best corresponding target language translationĝ; 3 i.e., g = arg max g p(g|e) .
We estimate the probability p(g|e) through a loglinear combination of component models with parameters λ m scored on the derivations d such that the source tree s(d) of d is in the parse forest of e and the yield of the target tree t(d) reads g. With (1) Translation weight normalized by source root symbol (2) Translation weight normalized by all root symbols (3) Lexical translation weight source → target (4) Lexical translation weight target → source (5) Target side language model: p(g) (6) Input parse tree probability assigned to s(t) by the parser of e The rule weights required for (1) are relative frequencies normalized over all extracted rules with the same root symbol on the left-hand side. In the same fashion the rule weights required for (2) are relative frequencies normalized over all rules with the same root symbols on both sides. The lexical weights for (3) and (4) are obtained by multiplying the word translations w(g i |e j ) [respectively, w(e j |g i )] of lexically aligned words (g i , e j ) across (possibly discontiguous) target side sequences. 5 Whenever a source word e j is aligned to multiple target words, we average over the word 5 The lexical alignments are different from the links used to link nonterminals.

Implementation
Our implementation is very close to the theoretical model and consists of several independent components, most of which are implemented in Python. The system does not have any dependencies other than the need for parsers for the source and target language, a word alignment tool and optionally an implementation of some tuning algorithm.
Rule extraction From a parallel corpus of which both halves have been parsed and word aligned, multi bottom-up tree transducer rules are extracted according to the procedure laid out in (Maletti, 2011). In order to handle unknown words, we add dummy identity translation rules for lexical material that was not present in the training data. Translation model building Given a set of rules, translation weights (see above) are computed for each unique rule. The translation model is then converted into a source, a weight and a target model. The source model (an RTG represented in an efficient binary format) is used for decoding and maps input trees to trees over rule identifiers representing derivations. The weight model and the target model can be used to reconstruct the weight and the target realization of a given derivation.
Decoder For every input sentence, the decoder transforms a forest of parse trees to a forest of translation derivations by means of forward application. These derivations are trees over the set of rules (represented by rule identifiers). One of the most useful aspects of our model is the fact that decoding is completely independent of the weights, as no pruning is performed and all translation candidates are preserved in the translation forest. Thus, even after decoding, the weight model can be changed, augmented by new features, etc.; even the target model can be changed, e.g. to support parse tree output instead of string output. In all of our experiments, we used string output, but it is conceivable to use other realizations. For instance, a syntactic language model could be used for output tree scoring. Also, recasing is extremely easy when we have part-of-speech tags to base our decision on (proper names are typically uppercase, as are all nouns in German).
Another benefit of having a packed representation of all candidates is that we can easily check whether the reference translation is included in the candidate set ("force decoding"). The freedom to allow arbitrary target models that rewrite derivations is related to current work on interpreted regular tree grammars (Koller and Kuhlmann, 2011), where arbitrary algebras can be used to compute a realization of the output tree.
k-best extractor From the translation derivation RTGs, a k-best list of derivations can be extracted (Huang and Chiang, 2005) very efficiently. This is the only step that has to be repeated if the rule weights or the parameters of the log-linear model change. The derivations are then mapped to target language sentences (if several derivations realize the same target sentence, their weights are summed) and reranked according to a language model (as was done in Huang et al. (2006)). This is the only part of the pipeline where we deviate from the theoretical log-linear model, and this is where we might make search errors. In principle, one could integrate the language model by intersection with the translation model (as the stateful MBOT model is closed under intersection with finite automata), but this is (currently) not computationally feasible due to the size of models.
Tuning Minimum error rate training (Och, 2003) is implemented using Z-MERT 7 (Zaidan, 2009). A set of source sentences is (forest-)parsed and decoded; the translation forests are stored on disk. Then, in each iteration of Z-MERT, it suffices to extract k-best lists from the translation forests according to the current weight vector.

WMT1Experimental setup
We used the training data that was made available for the WMT15 shared translation task on English-German 8 . It consists of three parallel corpora (1.8M sentences of European parliament proceedings, 216K sentences of newswire text, and 2.3M sentences of web text after cleanup) and additional monolingual news data for language model training.
The English half of the parallel data was parsed using Egret 9 which is a re-implementation of the Berkeley parser (Petrov et al., 2006). For the German parse, we used the BitPar parser (Schmid, 2004;Schmid, 2006). The BitPar German grammar is highly detailed, which makes the syntactic information contained in the parses extremely useful. Part-of-speech tags and category label are augmented by case, number and gender information, as can be seen in the German parse tree in Figure 1. We only kept the best parse for each sentence during training.
We then trained a 5-gram language model on monolingual data using KenLM 10 (Heafield, 2011;Heafield et al., 2013). Word alignment was achieved using the fast align 11 word aligner from cdec (Dyer et al., 2010). As usual, we discarded sentence pairs where one sentence was significantly longer than the other, as well as those that were too long or too short.
For tuning, we chose the WMT12 test set (3,003 sentences of newswire text), available as part of the development data for the WMT13 shared translation task. Since our system had limited coverage on this tuning set, we limited ourselves to the first a subset of sentences we could translate.
When translating the test set, our models used parse trees delivered by the Egret parser. After translation, recasing was done by examining the output syntax tree, using a simple heuristics looking for nouns and sentence boundaries as well as common abbreviations. Since coverage on the test set was also limited, we used a simple wordbased fallback system whenever an untranslated state was encountered in a derivation tree.
BLEU  14.4 .777 Table 1: BLEU and TER scores of our system.
Results are significantly worse compared to last year's system which used morphological enhancements such as compound splitting (Quernheim and Cap, 2014) and a phrase-based fallback system for sentences that the exact decoder could not handle. However, we should note that where the fallback system was not needed, we achieved a BLEU score of 16.7.
From a linguistic point of view, constructions that involve long-distance reordering and agreement are typically handled well. Figure 4 shows some example sentences from the WMT13 test set in comparison to a phrase-based baseline system.
On the other hand, our system frequently makes mistakes in lexical choice, and often uses rules that have been extracted from erroneous alignments. Sometimes, these mistakes cannot be alleviated by the language model due to data sparsity (no competing good candidate translation).

Conclusion and further work
We presented our submission to the WMT15 shared translation task based on a novel, promising "full syntax, no pruning" tree-to-tree approach to statistical machine translation, inspired by Huang et al. (2006). There are, however, still major drawbacks and open problems associated with our approach. Firstly, the coverage can still be significantly improved. In these experiments, our model was able to translate only 70% of the test sentences. To some extent, this number can be improved by providing more training data. Also, more rules can be extracted if we not only use the best parse for rule extraction, but multiple parse trees, or even switch to forest-based rule extraction . Finally, the size of the input parse forest plays a role. For instance, if we only supply the best parse to our model, translation will fail for approximately half of the input.
However, there are inherent coverage limits. Since our model is extremely strict, it will never Figure 4: Examples from the test set where our MBOT system performed better, linguistically speaking; (M = MBOT system; P = phrase-based baseline system; R = reference translation; S = source sentence). Rough interlinear glosses are provided. be able to translate sentences whose parse trees contain structures it has never seen before, since it has to match at least one input parse tree exactly. While we implemented a simple solution to handle unknown words, the issue with unknown structures is not so easy to solve without breaking the otherwise theoretically sound approach. Possibly, glue rules can help.
The second drawback is runtime. We were able to translate about 20 sentences per hour on one processor. Distributing the translation task on different machines, we were able to translate the WMT15 test set (3k sentences) in roughly three days. Given that the trend goes towards parallel programming, and considering the fact that our decoder is written in the rather slow language Python, we are confident that this is not a major problem. We were able to run the whole pipeline of training, tuning and evaluation on the WMT15 shared task data in less than one week. We are currently investigating whether A* k-best algorithms (Pauls and Klein, 2009;Pauls et al., 2010) can help to guide the translation process while maintaining optimality.
Thirdly, currently the language model is not integrated, but implemented as a separate reranking component. We are aware that an integrated language model might improve translation quality (see e.g. Chiang (2007) where 3-4 BLEU points are gained by LM integration). Some research on this topic already exists, e.g. (Rush and Collins, 2011) who use dual decomposition, and (Aziz et al., 2013) who replace intersection with an upper bound which is easier to compute. It might also be feasible to intersect the language model (represented by a regular string grammar) lazily.