The Edit Distance Transducer in Action: The University of Cambridge English-German System at WMT16

This paper presents the University of Cambridge submission to WMT16. Motivated by the complementary nature of syntactical machine translation and neural machine translation (NMT), we exploit the synergies of Hiero and NMT in different combination schemes. Starting out with a simple neural lattice rescoring approach, we show that the Hiero lattices are often too narrow for NMT ensembles. Therefore, instead of a hard restriction of the NMT search space to the lattice, we propose to loosely couple NMT and Hiero by composition with a modified version of the edit distance transducer. The loose combination outperforms lattice rescoring, especially when using multiple NMT systems in an ensemble.


Introduction
Previous work suggests that syntactic machine translation such as Hiero (Chiang, 2007) and Neural Machine Translation (NMT) (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Cho et al., 2014;Bahdanau et al., 2015) are very different and have complementary strengths and weaknesses (Neubig et al., 2015;Stahlberg et al., 2016).Recent attempts to combine syntactic SMT and NMT report large gains over both baselines.Authors in (Neubig et al., 2015) used NMT to rescore n-best lists which were generated with a syntax-based system.They report that even with 1000-best lists, the gains of using the NMT rescorer often do not saturate.Syntactically Guided NMT (Stahlberg et al., 2016, SGNMT) constrains the NMT search space to Hiero translation lattices which contain significantly more hypotheses than n-best lists.In SGNMT, an NMT beam decoder with a relatively small beam can explore spaces much larger than n-best lists, yielding BLEU score improvements with far fewer expensive NMT evaluations.
However, these rescoring approaches enforce an exact match between the NMT and syntactic decoders.In general, this kind of hard restriction is best avoided when combining diverse systems (Liu et al., 2009;Frederking et al., 1994).For example, in speech recognition, ROVER (Fiscus, 1997) is a system combination approach based on a soft voting scheme.In machine translation, minimum Bayes-risk (MBR) decoding (Kumar and Byrne, 2004) can be used to combine multiple systems (de Gispert et al., 2009).MBR also does not enforce exact agreement between systems as it distinguishes between the hypothesis space and the evidence space (Goel and Byrne, 2000;Tromble et al., 2008).
We find that Hiero lattices generated by grammars extracted with the usual heuristics (Chiang, 2007) do not provide enough variety to explore the full potential of neural models, especially when using NMT ensembles.Therefore, we present a "soft" lattice-based combination scheme which uses standard operations on finite state transducers such as composition.Our method replaces the hard combination in previous methods with a similarity measure based on the edit distance, and gives the NMT decoder more freedom to diverge from the Hiero translations.We find that this loose coupling scheme is especially useful when using NMT ensembles.

Combining Hiero and NMT via Edit
Distance Transducer  Figure 1: "Flower automata" for calculating edit distances over the alphabet {a, b, UNK}.
search.With loose coupling, the NMT decoder is not restricted to the Hiero lattice as in previous work, but runs independently to produce translation lattices on its own, which are then combined with the Hiero lattices.The combination does not require an exact match.Instead, we will describe a procedure for combining NMT and Hiero that captures similarity under the edit distance and both the NMT and Hiero translation system scores.This scheme is implemented efficiently using standard FST operations (Allauzen et al., 2007).First, we introduce the FST composition operation and the edit distance transducer.We will describe the whole pipeline in Sec.2.3.

Composition of Finite State Transducers
The composition of two weighted transducers T 1 , T 2 (denoted as T 1 • T 2 ) over a semiring (K, ⊕, ⊗) is defined following (Mohri, 2004) We will make extensive use of this operation as tool for building complex automata which make use of both the NMT and Hiero translation lattices.

The Edit Distance Transducer
Composition can be used together with a "flower automaton" to calculate the edit distance between two sequences (Mohri, 2003).The edit distance transducer shown in Fig. 1(a) transduces a sequence x to another sequence y over the alphabet {a, b} and accumulates the number of edit operations via the transitions with cost 1.In our case, x corresponds to an NMT hypothesis which is to be combined with a Hiero hypothesis y.In contrast to SGNMT, where we require an exact match between NMT and Hiero (up to UNKs), our editdistance-based scheme allows different hypotheses to be combined.We replaced the standard definition of the edit distance transducer (Mohri, 2003) by a finer-grained model designed to work well for combining NMT and Hiero.Instead of uniform costs, we lower the cost for UNK substitutions as we want to encourage substituting NMT UNKs by words in the Hiero translation.We distinguish between three types of edit operations.
• Type I: Substituting UNK with a word outside the NMT vocabulary is free.
• Type II: For substitutions of UNK with a word inside the NMT vocabulary we add the cost λ sub .
• Type III: All other edit operations are penalized with cost λ edit (and λ edit > λ sub ).
We will refer to the modified edit distance transducer as E. Fig. 1(b) shows E over the alphabet {a, b, UNK}, with 'a' being an NMT OOV.

Loose Coupling of Hiero and NMT
Our edit-distance-based scheme combines an NMT translation lattice N with a Hiero translation lattice H. Weights in N and H are scaled by λ nmt and λ hiero , respectively.The similarity measure between NMT and Hiero translations is parametrized with λ ins , λ edit , and λ sub .We keep the various costs separated by using transducers with tropical sparse tuple vector semirings (Iglesias et al., 2015).Instead of single real-valued arc weights, this semiring uses vectors which can hold multiple features.The inner product of these vectors with a constant parameter vector determines   The shortest path in H containing the string nicht erlaubt sein sollte zu has grammatical and stylistic flaws but is complete, whereas there is a better path in N with an UNK.Our goal is to merge these two hypotheses by using the NMT translation in N with the UNK replaced by a word from the Hiero lattice H.
1. Adding UNK insertions.We found that often NMT produces an isolated UNK token, even if multiple tokens are required.Therefore, we allow extending a single UNK token to a sequence of up to three UNK tokens.This is realized by replacing UNK arcs in N with the transducer U shown in Fig. 3 using OpenFST's Replace operation.Fig. 2(c) shows the result of the replace operation when applied to the example lattice N in Fig. 2(b).We denote this operation as follows: 1 The ucam-smt tutorial contains details to the tropical sparse tuple vector semiring: http://ucamsmt.github.io/tutorial/basictrans.html#lmert veclats tst 2. Composition with the edit distance transducer.The next step finds the edit distances to the Hiero hypotheses as described in Sec.2.2.
3. Shortest path.The above operation generates very large lattices, and dumping all of them is not feasible.We could use disambiguation (Iglesias et al., 2015;Mohri and Riley, 2015) on the combined transducer C to find the best alignment for each unique NMT hypothesis.However, we only need the single shortest path in order to generate the combined translation.
ShortestPath(C) (4) 4. Projection.A complete path in the transducer C has an NMT hypothesis on the input labels (marked green in Fig. 2(d)) and a Hiero hypothesis on the output labels (marked blue in Fig. 2(d)).Therefore, we can generate different translations from the best path in C. If we project the input labels on the output labels with OpenFST's Project, we obtain a hypothesis t N M T in the NMT lattice N .
However, t N M T still contains UNKs.If we project on the input labels, we end up with the aligned Hiero hypothesis without UNKs (blue labels in Fig. 2(d)) but we do not use the NMT translation directly.Therefore, we introduce a new projection function Π U N K which switches between preserving symbols on the input and output tapes: if the input label on an arc is UNK, we write the output label over the input label.
Otherwise, we write the input label over the output label.This is equivalent to projecting the output labels to the input labels only if the input label is UNK, and then projecting the input labels to the output labels.As shown in Fig. 2(e), we obtain the NMT hypothesis, but the UNK is replaced by the matching word Grosswahlstadt from the Hiero lattice.Thus, the final combined translation is described by the following term: In general, the final hypothesis t comb is a mix of an NMT and a Hiero hypothesis.We do not search for t comb directly but for pairs of NMT and Hiero translations which optimize the individual model scores as well as the distance between them.Stated more formally, the shortest path in C yields a pair ( t N M T , t Hiero ) for which holds where d edit (t N , t H ) is the modified edit distance between t N and t H (according E and U ), and S N (t N |s) and S H (t H |s) are the scores NMT and Hiero assign to the translations given source sentence s.If we interpret these scores as negative log-likelihoods, we arrive at a probabilistic interpretation of Eq. 8.
with (assuming independence) Eq. 9 suggests that we maximize the product of two quantities -the similarity between Hiero and NMT hypotheses and their joint probability.The FST operations allow to optimize over the set N × H efficiently.Note that the NMT lattice N is rather small in our case (|N | ≤ 20) due to the small beam size used in NMT decoding.This makes it possible to solve Eq. 8 almost always without pruning2 .

Experimental Setup
The parallel training data includes Europarl v7, Common Crawl, and News Commentary v10.Sentence pairs with sentences longer than 80 words or length ratios exceeding 2.4:1 were deleted, as were Common Crawl sentences from other languages (Shuyo, 2010).We use news-test2014 (the filtered version) as a development set, and keep news-test2015 and news-test2016 as test sets.
The NMT systems are built using the Blocks framework (van Merriënboer et al., 2015) based on the Theano library (Bastien et al., 2012) with the network architecture and hyper-parameters as in (Bahdanau et al., 2015): the encoder and decoder networks consist of 1000 gated recurrent units (Cho et al., 2014).The decoder uses a single maxout (Goodfellow et al., 2013) output layer with the feed-forward attention model described in (Bahdanau et al., 2015).In our final ensemble, we use 8 independently trained NMT systems with vocabulary sizes between 30,000 and 60,000.
Rules for our En-De Hiero system were extracted as described in (de Gispert et al., 2010).A 5-gram language model for the Hiero system was trained on WMT16 parallel and monolingual data (Heafield et al., 2013).
We apply gentle post-processing to the German output for fixing small number and currency formatting issues.The English source sentences in the training corpus are lower-cased.During decoding, we lower case only in-vocabulary words, and pass through OOVs with correct casing.We apply a simple heuristic for recognizing surnames to avoid literal translation of them into German3 .

Results
Tab. 1 reports performance on news-test2014, news-test2015, and news-test20165 .Similarly to previous work (Stahlberg et al., 2016), we observe that rescoring Hiero lattices with NMT (SGNMT) outperforms both NMT and Hiero baselines significantly on all test sets.For SGNMT, we see further improvements of between +0.7 BLEU (news-test2014) and +1.1 BLEU (news-test2015) by using NMT ensembles rather than single NMT.However, these gains are rather small considering the improvements from using ensembles for the (pure) NMT baseline (between +1.9 BLEU and +2.2 BLEU).Our combination scheme makes better use of the ensembles.We report 31.3BLEU on news-test2016, which in the English-German WMT'16 evaluation is among the best systems (within 0.1 BLEU) which do not use back-translation (Sennrich et al., 2016a).Backtranslation is a technique for making use of monolingual data in NMT training, and we expect our system could benefit from back-translation, although we leave this analysis to future work.
The combination procedure we propose is nontrivial.It is not immediately clear how the gains arise as the final scores are mixtures between edit distance costs, NMT scores, and Hiero scores.In the remainder we will try to provide some insight.Unless stated otherwise, we report investigations into the Hiero + NMT 8-system ensemble which yields the best results in Tab. 1.
First, we focus on the projection function Π U N K (•) which switches between preserving the input and output label at the UNK symbol to produce the combined translation t comb (Eq.7).As explained in Sec.2.3, we can use OpenFST's Project operation to fetch the NMT and Hiero hypotheses t N M T and t Hiero which have been used to produce the combined translation (Eq. 5 and 6).Tab. 2 shows that the hypotheses that are aligned in the final transducer are often not the 1best translations of any of the baseline systems.
Remarkably, using the t Hiero translations results in 30.4 BLEU, which is a very substantial im- provement over the baseline Hiero system (26.0BLEU).Note that this BLEU score is achieved with hypotheses from the original Hiero lattice H but weighted in combination with the NMT scores and the edit distance.However, these selected paths are often given very low scores by Hiero: in only 8.6% of the sentences is the Hiero hypothesis left unchanged.If we look for t Hiero in the Hiero n-best list, we find that even very deep 20,000-best lists contain only 63.5% of the Hiero hypotheses which were selected by the combination scheme (Fig. 4).This indicates the benefit in using latticebased approaches over n-best lists.
Next, we investigate the distance measure between NMT and Hiero translations, which is realized with the UNK insertion transducer U and the modified edit distance transducer E (Sec.2.3).Tab. 3 shows that UNK insertions are relatively rare compared to the edit operations of types II and III allowed by E (Sec.2.3).The average edit distance between NMT and Hiero disregarding UNKs on the best path (type III) is 1.74.In 61.7% of the cases the input and output labels differ not only at UNK -i.e. in only 38.3% of the sentences do we have an exact match between NMT and Hiero.We note that UNK is often replaced with an NMT in-vocabulary word (55.9% of the sentences).It seems that NMT often produces an UNK even if a better word is in the NMT vocabulary.This could be due to the over-representation of UNK in the NMT training corpus.
To study the effectiveness of our edit distance transducer based combination scheme in correcting NMT UNKs, we trained individual NMT systems with vocabulary sizes between 10,000 and 60,000.Tab. 4 shows that nearly one in six tokens (16.3%) produced by our pure NMT system with a vocabulary size of 30,000 are UNKs.Increasing the NMT vocabulary to 50k or 60k does improve pure NMT very significantly, but results show that these improvements are already captured by the combination scheme with Hiero.As in the literature, we see large variation in performance over individual NMT systems even with the same vocabulary size (Sennrich et al., 2016b), which could explain the small performance drop when increasing the vocabulary size from 50k to 60k.
One important practical issue for system building is the number of systems to be ensembled as training each individual NMT system takes a significant amount of time.Fig. 5  Table 4: BLEU scores on news-test2016 for different vocabulary sizes (single NMT).Each individual NMT system is combined with Hiero as described in Sec.2.3.
Figure 5: BLEU score over the number of systems in the ensemble on news-test2016.
for 8-ensembles the gains for pure NMT do not seem to saturate.The combination with Hiero via edit distance transducer also greatly benefits from using ensembles, but most of the gains are gotten with fewer systems.

Conclusion and Future Work
We have presented a method based on the edit distance that is effective in combining Hiero SMT systems with NMT ensembles.Our approach makes use of standard WFST operations, and we showed the effectiveness of the approach with a successful WMT'16 submission for English-German.In the future, we are planning to add back-translation (Sennrich et al., 2016a) and investigate the use of character-or subword-based NMT (Sennrich et al., 2016b;Chitnis and DeNero, 2015;Ling et al., 2015;Chung et al., 2016;Luong and Manning, 2016) within our combination framework.
(a) Standard edit distance transducer.(b) Modified edit distance transducer E. 'a' is an NMT OOV.
(a) Example Hiero lattice H.(b) Example NMT lattice N .(c) Transducer with UNK insertion arcs: Replace(N, UNK, U ).(d) Best path in the combined transducer C. Hiero scores are omitted in this figure.(e)Projection of the best path: ΠUNK (ShortestPath(C)).The final hypothesis is die regionale Politik in Grosswahlstadt darf jedoch nicht leiden.

Figure 4 :
Figure 4: Percentage of t Hiero hypotheses found in the baseline Hiero n-best list.

Table 3 :
Breakdown of the distances measured between NMT and Hiero along the shortest path in C on news-test2016.
indicates that even