Neural Machine Translation by Minimising the Bayes-risk with Respect to Syntactic Translation Lattices

We present a novel scheme to combine neural machine translation (NMT) with traditional statistical machine translation (SMT). Our approach borrows ideas from linearised lattice minimum Bayes-risk decoding for SMT. The NMT score is combined with the Bayes-risk of the translation according the SMT lattice. This makes our approach much more flexible than $n$-best list or lattice rescoring as the neural decoder is not restricted to the SMT search space. We show an efficient and simple way to integrate risk estimation into the NMT decoder which is suitable for word-level as well as subword-unit-level NMT. We test our method on English-German and Japanese-English and report significant gains over lattice rescoring on several data sets for both single and ensembled NMT. The MBR decoder produces entirely new hypotheses far beyond simply rescoring the SMT search space or fixing UNKs in the NMT output.


Introduction
Lattice minimum Bayes-risk (LMBR) decoding has been applied successfully to translation lattices in traditional SMT to improve translation performance of a single system (Kumar and Byrne, 2004;Tromble et al., 2008;. However, minimum Bayes-risk (MBR) decoding is also a very powerful framework for combining diverse systems (Sim et al., 2007;de Gispert et al., 2009). Therefore, we study combining traditional SMT and NMT in a hybrid decoding scheme based on MBR. We argue that MBR-based methods in their present form are not well-suited for NMT because of the following reasons: • Previous approaches work well with rich lattices and diverse hypotheses. However, NMT decoding usually relies on beam search with a limited beam and thus produces very narrow lattices (Li and Jurafsky, 2016;Vijayakumar et al., 2016).
• NMT decoding is computationally expensive. Therefore, it is difficult to collect the statistics needed for risk calculation for NMT.
• The Bayes-risk in SMT is usually defined for complete translations. Therefore, the risk computation needs to be restructured in order to integrate it in an NMT decoder which builds up hypotheses from left to right.
To address these challenges, we use a special loss function which is computationally tractable as it avoids using NMT scores for risk calculation. We show how to reformulate the original LMBR decision rule for using it in a word-based NMT decoder which is not restricted to an n-best list or a lattice. Our hybrid system outperforms lattice rescoring on multiple data sets for English-German and Japanese-English. We report similar gains from applying our method to subword-unitbased NMT rather than word-based NMT.

Combining NMT and SMT by
Minimising the Lattice Bayes-risk We propose to collect statistics for MBR from a potentially large translation lattice generated with SMT, and use the n-gram posteriors as additional score in NMT decoding. The LMBR decision rule used by Tromble et al. (2008) has the form where Y h is the hypothesis space of possible translations, Y e is the evidence space for computing the Bayes-risk, and N is the set of all n-grams in Y e (typically, n = 1 . . . 4). In this work, our evidence space Y e is a translation lattice generated with SMT. The function # u (y) counts how often n-gram u occurs in translation y. P (u|Y e ) denotes the path posterior probability of u in Y e . Our aim is to integrate these n-gram posteriors into the NMT decoder since they correlate well with the presence of n-grams in reference translations (de Gispert et al., 2013). We call the quantity to be maximised the evidence E(y) which corresponds to the (negative) Bayes-risk which is normally minimised in MBR decoding. We emphasize that this risk can be computed for any translation hypothesis and not only those produced by the SMT system. NMT assigns a probability to a translation y = y T 1 of source sentence x via a left-to-right factorisation based on the chain rule: where g(·) is a neural network using the hidden state of the decoder network s t and the context vector c t which encodes relevant parts of the source sentence . 1 P N M T (·) can also represent an ensemble of NMT systems in which case the scores of the individual systems are multiplied together to form a single distribution. Applying the LMBR decision rule in Eq. 1 directly to NMT would involve computing P N M T (y|x) for all translations in the evidence space. In case of LMBR this is equivalent to rescoring the entire translation lattice exhaustively with NMT. However, this is not feasible even for small lattices because the evaluation of g(·) is computationally very expensive. Therefore, we propose to calculate the Bayes-risk over SMT translation lattices using only pure SMT scores, and bias the NMT decoder towards low-risk hypotheses. Our final combined decision rule iŝ y = arg max y E(y)+λ log P N M T (y|x) . (3) If y contains a word not in the NMT vocabulary, the NMT model provides a score and updates its decoder state as for an unknown word. We note that E(y) can be computed even if y is not in the SMT lattice. Therefore, Eq. 3 can be used to generate translations outside the SMT search space. We further note that Eq. 3 can be derived as an instance of LMBR under a modified loss function.

Left-to-right Decoding
Beam search is often used for NMT because the factorisation in Eq. 2 allows to build up hypotheses from left to right. In contrast, our definition of the evidence in Eq. 1 contains a sum over the (unordered) set of all n-grams. However, we can rewrite our objective function in Eq. 3 in a way which makes it easy to use with beam search.
for n-grams up to order 4. This form lends itself naturally to beam search: at each time step, we add to the previous partial hypothesis score both the log-likelihood of the last token according the NMT model, and the partial MBR gains from the current n-gram history. Note that this is similar to applying (the exponentiated scores of) an interpolated language model based on n-gram posteriors extracted from the SMT lattice. In the remainder of this paper, we will refer to decoding according Eq. 4 as MBR-based NMT.

Efficient n-gram Posterior Calculation
The risk computation in our approach is based on posterior probabilities P (u|Y e ) for n-grams u Setup news-test2014 news-test2015 news-test2016 SMT baseline (de Gispert et al., 2010, HiFST) 18 We use the framework of  based on n-gram mapping and path counting transducers to efficiently pre-compute all non-zero values of P (u|Y e ). Complete enumeration of all n-grams in a lattice is usually feasible even for very large lattices . Additionally, for all these n-grams u, we smooth P (u|Y e ) by mixing it with the uniform distribution to flatten the distribution and increase the offset to n-grams which are not in the lattice.

Subword-unit-based NMT
Character-based or subword-unit-based NMT (Chitnis and DeNero, 2015;Sennrich et al., 2016;Chung et al., 2016;Luong and Manning, 2016;Costa-Jussà and Fonollosa, 2016;Ling et al., 2015; does not use isolated words as modelling units but applies a finer grained tokenization scheme. One of the main motivation for these approaches is to overcome the limited vocabulary in word-based NMT. We consider our hybrid system as an alternative way to fix NMT OOVs. However, our method can also be used with subword-unit-based NMT. In this work, we use byte pair encodings (Sennrich et al., 2016, BPE) to test combining word-based SMT with subword-unit-based NMT via both lattice rescoring and MBR. First, we construct a finite state transducer (FST) which maps word sequences to BPE sequences. Then, we convert the word-based SMT lattices to BPE-based lattices by composing them with the mapping transducer and projecting the output tape using standard OpenFST operations (Allauzen et al., 2007). The converted lattices are used for extracting n-gram posteriors as described in the previous sections. Note that even though the n-grams are on the BPE level, their posteriors are computed from word-level SMT translation scores.

Experimental Setup
We test our approach on English-German (En-De) and Japanese-English (Ja-En). For En-De, we use the WMT news-test2014 (the filtered version) as a development set, and keep news-test2015 and news-test2016 as test sets. For Ja-En, we use the ASPEC corpus (Nakazawa et al., 2016) to be strictly comparable to the evaluation done in the Workshop of Asian Translation (WAT).
The NMT systems are as described by Stahlberg et al. (2016b) using the Blocks and Theano frameworks (van Merriënboer et al., 2015;Bastien et al., 2012) with hyper-parameters as in  and a vocabulary size of 30k for Ja-En and 50k for En-De. We use the coverage penalty proposed by  to improve the length and coverage of translations. Our final ensembles combine five (En-De) to six (Ja-En) independently trained NMT systems.
Our En-De SMT baseline is a hierarchical system based on the HiFST package 3 which produces rich output lattices. The system uses rules ex-Setup dev test SMT baseline (Neubig, 2013, Travatar) 19   (Heafield et al., 2013).
In Ja-En we use Travatar (Neubig, 2013), an open-source tree-to-string system. We provide the system with Japanese trees obtained using the Ckylark parser (Oda et al., 2015) and train it on high-quality alignments as recommended by Neubig and Due (2014). This system, which reproduces the results of the best submission in WAT 2014 (Neubig, 2014), is used to create a 10k-best list of hypotheses, which we convert into determinised and minimised FSAs for our work. Our Ja-En NMT models are trained on the same 500k training samples as the Travatar baseline.
The parameter λ is tuned by optimising the BLEU score on the validation set, and we set Θ i = 1 (i = 0, . . . , 4). Using the BOBYQA algorithm (Powell, 2009) or lattice MERT  to optimise the Θ-parameters independently did not yield improvements. The beam search implementation of the SGNMT decoder 4 (Stahlberg et al., 2016b) is used in all our experiments. We set the beam size to 20 for En-De and 12 for Ja-En.

Results
Our results are summarised in Tab. 1 and 2. 6 Our approach outperforms both single NMT and SMT baselines by up to 3.4 BLEU for En-De and 2.8 BLEU for Ja-En. Ensembling yields further gains across all test sets both for the NMT baselines and our MBR-based hybrid systems. We see substan-4 http://ucam-smt.github.io/sgnmt/html/ 5 Comparable to http://lotus.kuee.kyoto-u. ac.jp/WAT/evaluation/list.php?t=2 6 Instructions for reproducing our key results will be available upon publication at http://ucam-smt.github. io/sgnmt/html/tutorial.html tial gains from our MBR-based method over lattice rescoring for both single and ensembled NMT on all test sets and language pairs except En-De news-test2016. On Ja-En, we report 26.7 BLEU 5 , second to only one system (as of February 2017) that uses a number of techniques such as minimum risk training and a much larger vocabulary size which could also be used in our framework.
Our word-level NMT baselines suffer from their limited vocabulary since we do not apply postprocessing techniques like UNK-replace (Luong et al., 2015). Therefore, NMT with subword units (BPE) consistently outperforms them by a large margin. Lattice rescoring and MBR yield large gains for both BPE-based and word-based NMT. However, the performance difference between BPE-and word-level NMT diminishes with lattice rescoring and MBR decoding: rescoring with NMT often performs on the same level for both words and subword units, and MBR-based NMT is often even better with a word-level NMT baseline. This indicates that subword units are often not necessary when the hybrid system has access to a large word-level vocabulary like the SMT vocabulary.
Note that the BPE lattice rescoring system is constrained to produce words in the output vocabulary of the syntactic SMT system and is prevented from inventing new target language words out of combinations of subword units. MBR imposes a soft version of such a constraint by biasing the BPE-based system towards words in the SMT search space.
The hypotheses produced by our MBR-based method often differ from the translations in the baseline systems. For example, 77.8% of the translations from our best MBR-based system on Ja-En cannot be found in the SMT 10k-best list, and 78.0% do not match the translation from the pure NMT 6-ensemble. 7 This suggests that our MBR decoder is able to produce entirely new hypotheses, and that our method has a profound effect on the translations which goes beyond rescoring the SMT search space or fixing UNKs in the NMT output.
Tab. 1 also shows that rescoring is sensitive to the size of the n-best list or lattice: rescoring the entire lattice instead of a 100-best list often yields a gain of 1 full BLEU point. In order to test our MBR-based method on small lattices, we compiled n-best lists of varying sizes to lattices and extracted n-gram posteriors from the reduced lattices. Fig. 1 shows that the n-best list size has an impact on both methods. Rescoring a 10-best list already yields a large improvement of 1.2 BLEU. However, the hypotheses are still close to the SMT baseline. The MBR-based approach can make better use of small n-best lists as it does not suffer this restriction. MBR-based combination on a 10-best list performs on about the same level as rescoring a 10,000-best list which demonstrates a practical advantage of MBR over rescoring.

Related Work
Combining the advantages of NMT and traditional SMT has received some attention in current research. A recent line of research attempts to integrate SMT-style translation tables into the NMT system (Zhang and Zong, 2016;Arthur et al., 2016;He et al., 2016).  interpolated NMT posteriors with word recommendations from SMT and jointly trained NMT together with a gating function which assigns the weight between SMT and NMT scores dynamically. Neu-7 Up to NMT OOVs. big et al. (2015) rescored n-best lists from a syntax-based SMT system with NMT. Stahlberg et al. (2016b) restricted the NMT search space to a Hiero lattice and reported improvements over nbest list rescoring. Stahlberg et al. (2016a) combined Hiero and NMT via a loose coupling scheme based on composition of finite state transducers and translation lattices which takes the edit distance between translations into account. Our approach is similar to the latter one since it allows to divert from SMT and generate translations without derivations in the SMT system. This ability is crucial for NMT ensembles because SMT lattices are often too narrow for the NMT decoder (Stahlberg et al., 2016a). However, the method proposed by Stahlberg et al. (2016a) insists on a monotone alignment between SMT and NMT translations to calculate the edit distance. This can be computationally expensive and not appropriate for MT where word reorderings are common. The MBR decoding described here does not have this shortcoming.

Conclusion
This paper discussed a novel method for blending NMT with traditional SMT by biasing NMT scores towards translations with low Bayes-risk with respect to the SMT lattice. We reported significant improvements of the new method over lattice rescoring on Japanese-English and English-German and showed that it can make good use even of very small lattices and n-best lists.
In this work, we calculated the Bayes-risk over non-neural SMT lattices. In the future, we are planning to introduce neural models to the risk estimation while keeping the computational complexity under control, e.g. by using neural n-gram language models (Bengio et al., 2003;Vaswani et al., 2013) or approximations of NMT scores (Lecorvé and Motlicek, 2012;Liu et al., 2016) for n-gram posterior calculation.