SGNMT – A Flexible NMT Decoding Platform for Quick Prototyping of New Models and Search Strategies

This paper introduces SGNMT, our experimental platform for machine translation research. SGNMT provides a generic interface to neural and symbolic scoring modules (predictors) with left-to-right semantic such as translation models like NMT, language models, translation lattices, n-best lists or other kinds of scores and constraints. Predictors can be combined with other predictors to form complex decoding tasks. SGNMT implements a number of search strategies for traversing the space spanned by the predictors which are appropriate for different predictor constellations. Adding new predictors or decoding strategies is particularly easy, making it a very efficient tool for prototyping new research ideas. SGNMT is actively being used by students in the MPhil program in Machine Learning, Speech and Language Technology at the University of Cambridge for course work and theses, as well as for most of the research work in our group.


Introduction
We are developing an open source decoding framework called SGNMT, short for Syntactically Guided Neural Machine Translation. 1 The software package supports a number of well-known frameworks, including TensorFlow 2 (Abadi et al., 2016), OpenFST (Allauzen et al., 2007), Blocks/Theano (Bastien et al., 2012;van Merriënboer et al., 2015), and NPLM (Vaswani et al., 2013). The two central concepts in the SGNMT tool are predictors and decoders. Predictors are scoring modules which define scores over the target language vocabulary given the current internal predictor state, the history, the source sentence, and external side information. Scores from multiple, diverse predictors can be combined for use in decoding.
Decoders are search strategies which traverse the space spanned by the predictors. SGNMT provides implementations of common search tree traversal algorithms like beam search. Since decoders differ in runtime complexity and the kind of search errors they make, different decoders are appropriate for different predictor constellations.
The strict separation of scoring module and search strategy and the decoupling of scoring modules from each other makes SGNMT a very flexible decoding tool for neural and symbolic models which is applicable not only to machine translation. SGNMT is based on the OpenFSTbased Cambridge SMT system (Allauzen et al., 2014). Although the system is less than a year old, we have found it to be very flexible and easy for new researchers to adopt. Our group has already integrated SGNMT into most of its research work.
We also find that SGNMT is very well-suited for teaching and student research projects. In the 2015-16 academic year, two students on the Cambridge MPhil in Machine Learning, Speech and Language Technology used SGNMT for their dissertation projects. 3 The first project involved using SGNMT with OpenFST for applying subword models in SMT (Gao, 2016). The second project developed automatic music composition by LSTMs where WFSAs were used to define the space of allowable chord progressions in 'Bach' chorales (Tomczak, 2016). The LSTM provides the 'creativity' and the WFSA enforces constraints FST ID of the current node in the FST.
Load FST from the file system, set the predictor state to the FST start node.
Explore all outgoing edges of the current node and use arc weights as scores.
Traverse the outgoing edge from the current node labelled with token and update the predictor state to the target node. n-gram Current n-gram history Set the current n-gram history to the begin-ofsentence symbol.
Return the LM scores for the current n-gram history.
Add token to the current n-gram history.
Word count None Empty Return a cost of 1 for all tokens except </s>.  that the chorales must obey. This second project in particular demonstrates the versatility of the approach. For the current, 2016-17 academic year, SGNMT is being used heavily in two courses.

Predictors
SGNMT consequently emphasizes flexibility and extensibility by providing a common interface to a wide range of constraints or models used in MT research. The concept facilitates quick prototyping of new research ideas. Our platform aims to minimize the effort required for implementation; decoding speed is secondary as optimized code for production systems can be produced once an idea has been proven successful in the SGNMT framework. In SGNMT, scores are assigned to partial hypotheses via one or many predictors. One predictor usually has a single responsibility as it represents a single model or type of constraint. Predictors need to implement the following methods: • initialize(src sentence) Initialize the predictor state using the source sentence.
• get state() Get the internal predictor state.
• set state(state) Set the internal predictor state.
• predict next() Given the internal predictor state, produce the posterior over target tokens for the next position.
The structure of the predictor state and the implementations of these methods differ substantially between predictors. Tab. 2 lists all predictors which are currently implemented. Tab. 1 summarizes the semantics of this interface for three very common predictors: the neural machine translation (NMT) predictor, the (deterministic) finite state transducer (FST) predictor for lattice rescoring, and the n-gram predictor for applying n-gram language models. We also included two examples (word count and UNK count) which do not have a natural left-to-right semantic but can still be represented as predictors.

Example Predictor Constellations
SGNMT allows combining any number of predictors and even multiple instances of the same predictor type. In case of multiple predictors we combine the predictor scores in a linear model. The following list illustrates that various interesting decoding tasks can be formulated as predictor combinations.
• nmt: A single NMT predictor represents pure NMT decoding.
• nmt,nmt,nmt: Using multiple NMT predictors is a natural way to represent ensemble decoding (Hansen and Salamon, 1990;Sutskever et al., 2014) in our framework.
• fst,nmt: NMT decoding constrained to an FST. This can be used for neural lattice rescoring  or other kinds of constraints, for example in the context of source side simplification in MT  or chord progressions in 'Bach' (Tomczak, 2016). The fst predictor can also be used to restrict the output of character-based or subword-unit-based NMT to a large word-level vocabulary encoded as FSA.

Decoders
Decoders are algorithms to search for the highest scoring hypothesis. The list of predictors determines how (partial) hypotheses are scored by implementing the methods initialize(·), get state(), set state(·), predict next(), and consume(·). The Decoder class implements versions of these methods which apply to all predictors in the list.
initialize(·) is always called prior to decoding a new sentence. Many popular search strategies can be described via the remaining methods get state(), set state(·), predict next(), and consume(·). Algs. 1 and 2 show how to define greedy and beam decoding in this way. 45 Tab. 3 contains a list of currently implemented decoders. The UML diagram in Fig. 1 illustrates the relation between decoders and predictors. h ← h · t 7: consume(t) 8: until t = </s> 9: return h NMT batch decoding The flexibility of the predictor framework comes with degradation in decoding time. SGNMT provides two ways of speeding up pure NMT decoding, especially on the GPU. The vanilla decoding strategy exposes the beam search implementation in Blocks (van Merriënboer et al., 2015) which processes all active hypotheses in the beam in parallel. We also implemented a beam decoder version which decodes multiple sentences at once (batch decoding) rather than in a sequential order. Batch decoding is potentially more efficient since larger batches can make better use of GPU parallelism. The key concepts of our batch decoder implementation are: • We use a scheduler running on a separate CPU thread to construct large batches of computation (GPU jobs) from multiple sentences and feeding them to the jobs queue.
• The GPU is operated by a single thread which communicates with the CPU scheduler thread via queues containing jobs. This thread is only responsible for retrieving jobs in the jobs queue, computing them, and putting them in the jobs results queue, minimizing the down-time of GPU computation.
• Yet another CPU thread is responsible for processing the results computed on the GPU end for 16: until Best hypothesis in H ends with </s> 17: return Best hypothesis in H in the job results queue, e.g. by getting the n-best words from the posteriors. Processed jobs are sent back to the CPU scheduler where they are reassembled into new jobs.
This decoder is able to translate the WMT English-French test sets news-test2012 to news-test2014 on a Titan X GPU with 911.6 words per second with the word-based NMT model described in . 6 This decoding speed seems to be slightly faster than sequential decoding with high-performance NMT decoders like Marian-NMT (Junczys-Dowmunt et al., 2016) with reported decoding speeds of 865 words per second. 7 However, batch decoding with Marian-NMT is much faster reaching over 4,500 words per second. 8 We think that these differences are mainly due to the limited multithreading support and performance in Python especially when using external libraries as opposed to the highly optimized C++ code in Marian-NMT. We did not push for even faster decoding as speed is not a major design goal of SGNMT. Note that batch decoding bypasses the predictor framework and can only be used for pure NMT decoding.
Ensembling with models at multiple tokenization levels SGNMT allows masking predictors with alternative sets of modelling units. The conversion between the tokenization schemes of different predictors is defined with FSTs. This makes it possible to decode by combining scores from both a subword-unit (BPE) based NMT (Sennrich et al., 2016) and a word-based NMT model with character-based NMT, masking the BPE-based and word-based NMT predictors with FSTs which transduce character sequences to BPE or word sequences. Masking is transparent to the decoding strategy as predictors are replaced by a special wrapper (fsttok) that uses the masking FST to translate predict next() and consume() calls to (a series of) predictor calls with alternative tokens. The syncbeam variation of beam search compares competing hypotheses only after consuming a special word boundary symbol rather than after each token. This allows combining scores at the word level even when using models with multiple levels of tokenization. Joint decoding with different tokenization schemes has the potential of combining the benefits of the different schemes: character-and BPE-based models are able to address rare words, but word-based NMT can model long-range dependencies more efficiently.
System-level combination We showed in Sec. 2.1 how to formulate NMT ensembling as a set of NMT predictors. Ensembling averages the individual model scores in each decoding step. Alternatively, system-level combination decodes the entire sentence with each model separately, and selects the best scoring complete hypothesis over all models. In our experiments, system-level combination is not as effective as en-1080), (b) a different training and test set, (c) a slightly different network architecture, and (d) words rather than subword units. 8 https://marian-nmt.github.io/ features/ sembling but still leads to moderate gains for pure NMT. However, a trivial implementation which selects the best translation in a postprocessing step after separate decoding runs is slow. The sepbeam decoding strategy reduces the runtime of system-level combination to the single system level. The strategy applies only one predictor rather than a linear combination of all predictors to expand a hypothesis. The single predictor is linked by the parent hypothesis. The initial stack in sepbeam contains hypotheses for each predictor (i.e. system) rather than only one as in normal beam search. We report a moderate gain of 0.5 BLEU over a single system on the Japanese-English ASPEC test set (Nakazawa et al., 2016) by combining three BPE-based NMT models from Stahlberg et al. (2017) using the sepbeam decoder.
Iterative beam search Normal beam search is difficult to use in a time-constrained setting since the runtime depends on the target sentence length which is a priori not known, and it is therefore hard to choose the right beam size beforehand. The bucket search algorithm sidesteps the problem of setting the beam size by repeatedly performing small beam search passes until a fixed computational budget is exhausted. Bucket search produces an initial hypothesis very quickly, and keeps the partial hypotheses for each length in buckets. Subsequent beam search passes refine the initial hypothesis by iteratively updating these buckets. Our initial experiments suggest that bucket search often performs on a similar level as standard beam search with the benefit of being able to support hard time constraints. Unlike beam search, bucket search lends itself to risk-free (i.e. admissible) pruning since all partial hypotheses worse than the current best complete hypothesis can be discarded.

Conclusion
This paper presented our SGNMT platform for prototyping new approaches to MT which involve both neural and symbolic models. SGNMT supports a number of different models and constraints via a common interface (predictors), and various search strategies (decoders). Furthermore, SGNMT focuses on minimizing the implementation effort for adding new predictors and decoders by decoupling scoring modules from each other and from the search algorithm. SGNMT is actively being used for teaching and research and we welcome contributions to its development, for example by implementing new predictors for using models trained with other frameworks and tools.