Marian: Fast Neural Machine Translation in C++

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.


Introduction
In this paper, we present Marian,1 an efficient Neural Machine Translation framework written in pure C++ with minimal dependencies.It has mainly been developed at the Adam Mickiewicz University in Poznań and at the University of Edinburgh.It is currently being deployed in multiple European projects and is the main translation and training engine behind the neural MT launch at the World Intellectual Property Organization. 2n the evolving eco-system of open-source NMT toolkits, Marian occupies its own niche best characterized by two aspects: • It is written completely in C++11 and intentionally does not provide Python bindings; model code and meta-algorithms are meant to be implemented in efficient C++ code.• It is self-contained with its own back end, which provides reverse-mode automatic differentiation based on dynamic graphs.
Marian has minimal dependencies (only Boost and CUDA or a BLAS library) and enables barrier-free optimization at all levels: metaalgorithms such as MPI-based multi-node training, efficient batched beam search, compact implementations of new models, custom operators, and custom GPU kernels.Intel has contributed and is optimizing a CPU backend.
Marian grew out of a C++ re-implementation of Nematus (Sennrich et al., 2017b), and still maintains binary-compatibility for common models.Hence, we will compare speed mostly against Nematus.OpenNMT (Klein et al., 2017), perhaps one of the most popular toolkits, has been reported to have training speed competitive to Nematus.

Design Outline
We will very briefly discuss the design of Marian.Technical details of the implementations will be provided in later work.

Custom Auto-Differentiation Engine
The deep-learning back-end included in Marian is based on reverse-mode auto-differentiation with dynamic computation graphs and among the established machine learning platforms most similar in design to DyNet (Neubig et al., 2017).While the back-end could be used for other tasks than machine translation, we choose to optimize specifically for this and similar use cases.Optimization on this level include for instance efficient implementations of various fused RNN cells, attention mechanisms or an atomic layer-normalization (Ba et al., 2016) operator.

Extensible Encoder-Decoder Framework
Inspired by the stateful feature function framework in Moses (Koehn et al., 2007) A Bahdanau-style encoder-decoder model would implement the entire encoder inside Encoder::build based on the content of the batch and place the resulting encoder context inside the EncoderState object.
Decoder::startState receives a list of EncoderState (one in the case of the Bahdanau model, multiple for multi-source models, none for language models) and creates the initial DecoderState.
The Decoder::step function consumes the target part of a batch to produce the output logits of a model.The time dimension is either expanded by broadcasting of single tensors or by looping over the individual time-steps (for instance in the case of RNNs).Loops and other control structures are just the standard built-in C++ operations.The same function can then be used to expand over all given time steps at once during training and scoring or step-by-step during translation.Current hypotheses state (e.g.RNN vectors) and current logits are placed in the next DecoderState object.
Decoder states are used mostly during translation to select the next set of translation hypotheses.Complex encoder-decoder models can derive from DecoderState to implement non-standard selection behavior, for instance hard-attention models need to increase attention indices based on the topscoring hypotheses.
This framework makes it possible to combine different encoders and decoders (e.g.RNN-based encoder with a Transformer decoder) and reduces implementation effort.In most cases it is enough to implement a single inference step in order to train, score and translate with a new model.

Efficient Meta-algorithms
On top of the auto-diff engine and encoderdecoder framework we implemented many efficient meta-algorithms.These include multi-device (GPU or CPU) training, scoring and batched beam search, ensembling of heterogeneous models (e.g.Deep RNN models and Transformer or language models), multi-node training and more.

Case Studies
In this section we will illustrate how we used the Marian toolkit to facilitate our own research across several NLP problems.Each subsection is meant as a showcase for different components of the toolkit and demonstrates the maturity and flexibility of the toolkit.Unless stated otherwise, all mentioned features are included in the Marian toolkit.
3.1 Improving over WMT2017 systems Sennrich et al. (2017a) proposed the highest scoring NMT system in terms of BLEU during the WMT 2017 shared task on English-German news translation (Bojar et al., 2017a), trained with the Nematus toolkit (Sennrich et al., 2017b).In this section, we demonstrate that we can replicate and slightly outperform these results with an identical model architecture implemented in Marian and improve on the recipe with a Transformer-style (Vaswani et al., 2017) model.

Deep Transition RNN Architecture
The model architecture in Sennrich et al. (2017a) is a sequence-to-sequence model with single-layer RNNs in both, the encoder and decoder.The RNN in the encoder is bi-directional.Depth is achieved by building stacked GRU-blocks resulting in very tall RNN cells for every recurrent step (deep transitions).The encoder consists of four GRU-blocks per cell, the decoder of eight GRU-blocks with an attention mechanism placed between the first and second block.As in Sennrich et al. (2017a), embeddings size is 512, RNN state size is 1024.We use layer-normalization (Ba et al., 2016) and variational drop-out with p = 0.1 (Gal and Ghahramani, 2016) inside GRU-blocks and attention.

Transformer Architecture
We very closely follow the architecture described in Vaswani et al. (2017) and their "base" model.

Training Recipe
Modeled after the description 3  first 16,000 iterations, starting with 0 until the base learning rate is reached.

Performance and Results
Quality.In terms of BLEU (Table 1), we match the original Nematus models from Sennrich et al. (2017a).Replacing the deep-transition RNN model with the transformer model results in a significant BLEU improvement of 1.2 BLEU on the WMT2017 test set.
Training speed.In Figure 1 we demonstrate the training speed as thousands of source tokens per second for the models trained in this recipe.All model types benefit from using more GPUs.Scaling is not linear (dashed lines), but close.The tokens-per-second rate (w/s) for Nematus on the same data on a single GPU is about 2800 w/s for the shallow model.Nematus does not have multi-GPU training.Marian achieves about 4 times faster training on a single GPU and about 30 times faster training on 8 GPUs for identical models.
Translation speed.The back-translation of 10M sentences with a shallow model takes about four hours on 8 GPUs at a speed of about 15,850 source tokens per second at a beam-size of 5 and a batch size of 64.Batches of sentences are translated in parallel on multiple GPUs.
In Table 2 we report the total number of seconds to translate newstest-2017 (3,004 sentences, 76,501 source BPE tokens) on a single GPU for different batch sizes.We omit model load time (usually below 10s).Beam size is 5.

State-of-the-art in Neural Automatic
Post-Editing In our submission to the Automatic Post-Editing shared task at WMT-2017 (Bojar et al., 2017b) and follow-up work (Junczys-Dowmunt and Grundkiewicz, 2017a,b), we explore multiple neural architectures adapted for the task of automatic postediting of machine translation output as implementations in Marian.We focus on neural end-toend models that combine both inputs mt (raw MT output) and src (source language input) in a single neural architecture, modeling {mt, src} → pe directly, where pe is post-edited corrected output.These models are based on multi-source neural translation models introduced by Zoph and Knight (2016).Furthermore, we investigate the effect of hard-attention models or neural transductors (Aharoni and Goldberg, 2016) which seem to be wellsuited for monolingual tasks, as well as combina-tions of both ideas.Dual-attention models that are combined with hard attention remain competitive despite applying fewer changes to the input.
The encoder-decoder framework described in section 2.2, allowed to integrate dual encoders and hard-attention without changes to beam-search or ensembling mechanisms.The dual-attention mechanism over two encoders allowed to recover missing words that would not be recognized based on raw MT output alone, see Figure 2.
Our final system for the APE shared task scored second-best according to automatic metrics and best based on human evaluation.

State-of-the-art in Neural Grammatical
Error Correction In Junczys-Dowmunt and Grundkiewicz (2018), we use Marian for research on transferring methods from low-resource NMT on the ground of automatic grammatical error correction (GEC).Previously, neural methods in GEC did not reach state-of-the-art results compared to phrase-based SMT baselines.We successfully adapt several low-resource MT methods for GEC.
We propose a set of model-independent methods for neural GEC that can be easily applied in most GEC settings.The combined effects of these methods result in better than state-of-the-art neural GEC models that outperform previously best neural GEC systems by more than 8% M 2 on the CoNLL-2014 benchmark and more than 4.5% on the JFLEG test set.Non-neural state-of-the-art systems are matched on the CoNLL-2014 benchmark and outperformed by 2% on JFLEG.
Figure 3 illustrates these results on the CoNLL-2014 test set.To produce this graph, 40 GEC models (four per entry) and 24 language models (one per GEC model with pre-training) have been trained.The language models follow the decoder architecture and can be used for transfer learning, weighted decode-time ensembling and re-ranking.This also includes a Transformer-style language model with self-attention layers.
Proposed methods include extensions to Marian, such as source-side noise, a GEC-specific weighted training-objective, usage of pre-trained embeddings, transfer learning with pre-trained language models, decode-time ensembling of independently trained GEC models and language models, and various deep architectures.

Future Work and Conclusions
We introduced Marian, a self-contained neural machine translation toolkit written in C++ with focus on efficiency and research.Future work on Marian's back-end will look at faster CPU-bound computation, auto-batching mechanisms and automatic kernel fusion.On the front-end side we hope to keep up with future state-of-the-art models.

Figure 3 :
Figure 3: Comparison on the CoNLL-2014 test set for investigated methods.
, we implement encoders and decoders as classes with the following (strongly simplified) interface:

Table 1 :
from Sennrich et al.  (2017a), we perform the following steps: BLEU results for our replication of the UEdin WMT17 system for the en-de news translation task.We reproduced most steps and replaced the deep RNN model with a Transformer model.Example for error recovery based on dual attention.The missing word "Satz" could only be recovered based on the original source (marked in red) as it was dropped in the raw MT output.