Nematus: a Toolkit for Neural Machine Translation

We present Nematus, a toolkit for Neural Machine Translation. The toolkit prioritizes high translation accuracy, usability, and extensibility. Nematus has been used to build top-performing submissions to shared translation tasks at WMT and IWSLT, and has been used to train systems for production environments.


Introduction
Neural Machine Translation (NMT) (Bahdanau et al., 2015;Sutskever et al., 2014) has recently established itself as a new state-of-the art in machine translation. We present Nematus 1 , a new toolkit for Neural Machine Translation.
Nematus has its roots in the dl4mt-tutorial. 2 We found the codebase of the tutorial to be compact, simple and easy to extend, while also producing high translation quality. These characteristics make it a good starting point for research in NMT. Nematus has been extended to include new functionality based on recent research, and has been used to build top-performing systems to last year's shared translation tasks at WMT  and IWSLT (Junczys-Dowmunt and Birch, 2016).
Nematus is implemented in Python, and based on the Theano framework (Theano Development Team, 2016). It implements an attentional encoder-decoder architecture similar to Bahdanau et al. (2015). Our neural network architecture differs in some aspect from theirs, and we will discuss differences in more detail. We will also describe additional functionality, aimed to enhance usability and performance, which has been implemented in Nematus. 1 available at https://github.com/rsennrich/nematus 2 https://github.com/nyu-dl/dl4mt-tutorial 2 Neural Network Architecture Nematus implements an attentional encoderdecoder architecture similar to the one described by Bahdanau et al. (2015), but with several implementation differences. The main differences are as follows: • We initialize the decoder hidden state with the mean of the source annotation, rather than the annotation at the last position of the encoder backward RNN.
• We implement a novel conditional GRU with attention.
• In the decoder, we use a feedforward hidden layer with tanh non-linearity rather than a maxout before the softmax layer.
• In both encoder and decoder word embedding layers, we do not use additional biases.
• Compared to Look, Generate, Update decoder phases in Bahdanau et al. (2015), we implement Look, Update, Generate which drastically simplifies the decoder implementation (see Table 1).
• Instead of a single word embedding at each source position, our input representations allows multiple features (or "factors") at each time step, with the final embedding being the concatenation of the embeddings of each feature .
We will here describe some differences in more detail:  (Bahdanau et al., 2015) Nematus Given a source sequence (x 1 , . . . , x Tx ) of length T x and a target sequence (y 1 , . . . , y Ty ) of length T y , let h i be the annotation of the source symbol at position i, obtained by concatenating the forward and backward encoder RNN hidden states, , and s j be the decoder hidden state at position j.
decoder initialization Bahdanau et al. (2015) initialize the decoder hidden state s with the last backward encoder state.
with W init as trained parameters. 3 We use the average annotation instead: conditional GRU with attention Nematus implements a novel conditional GRU with attention, cGRU att . A cGRU att uses its previous hidden state s j−1 , the whole set of source annotations C = {h 1 , . . . , h Tx } and the previously decoded symbol y j−1 in order to update its hidden state s j , which is further used to decode symbol y j at position j, s j = cGRU att (s j−1 , y j−1 , C) Our conditional GRU layer with attention mechanism, cGRU att , consists of three components: two GRU state transition blocks and an attention mechanism ATT in between. The first transition block, GRU 1 , combines the previous decoded symbol y j−1 and previous hidden state s j−1 in order to generate an intermediate representation s j with the following formulations: where E is the target word embedding matrix, s j is the proposal intermediate representation, r j and z j being the reset and update gate activations. In this formulation, W , U , W r , U r , W z , U z are trained model parameters; σ is the logistic sigmoid activation function.
The attention mechanism, ATT, inputs the entire context set C along with intermediate hidden state s j in order to compute the context vector c j as follows: where α ij is the normalized alignment weight between source symbol at position i and target symbol at position j and v a , U a , W a are the trained model parameters. Finally, the second transition block, GRU 2 , generates s j , the hidden state of the cGRU att , by looking at intermediate representation s j and context vector c j with the following formulations: similarly, s j being the proposal hidden state, r j and z j being the reset and update gate activations with the trained model parameters W, U, W r , U r , W z , U z . Note that the two GRU blocks are not individually recurrent, recurrence only occurs at the level of the whole cGRU layer. This way of combining RNN blocks is similar to what is referred in the literature as deep transition RNNs (Pascanu et al., 2014;Zilly et al., 2016) as opposed to the more common stacked RNNs (Schmidhuber, 1992;El Hihi and Bengio, 1995;Graves, 2013).
deep output Given s j , y j−1 , and c j , the output probability p(y j |s j , y j−1 , c j ) is computed by a softmax activation, using an intermediate representation t j .
p(y j |s j ,y j−1 , c j ) = softmax (t j W o )

Training Algorithms
By default, the training objective in Nematus is cross-entropy minimization on a parallel training corpus. Training is performed via stochastic gradient descent, or one of its variants with adaptive learning rate (Adadelta (Zeiler, 2012), RmsProp (Tieleman and Hinton, 2012), Adam (Kingma and Ba, 2014)).
To stabilize training, Nematus supports early stopping based on cross entropy, or an arbitrary loss function defined by the user.

Usability Features
In addition to the main algorithms to train and decode with an NMT model, Nematus includes features aimed towards facilitating experimentation with the models, and their visualisation. Various model parameters are configurable via a command-line interface, and we provide extensive documentation of options, and sample set-ups for training systems.
Nematus provides support for applying single models, as well as using multiple models in an ensemble -the latter is possible even if the model architectures differ, as long as the output vocabulary is the same. At each time step, the probability distribution of the ensemble is the geometric average of the individual models' probability distributions. The toolkit includes scripts for beam search decoding, parallel corpus scoring and n-best-list rescoring.
Nematus includes utilities to visualise the attention weights for a given sentence pair, and to visualise the beam search graph. An example of the latter is shown in Figure 1. Our demonstration will cover how to train a model using the commandline interface, and showing various functionalities of Nematus, including decoding and visualisation, with pre-trained models. 4

Conclusion
We have presented Nematus, a toolkit for Neural Machine Translation. We have described implementation differences to the architecture by Bahdanau et al. (2015); due to the empirically strong performance of Nematus, we consider these to be of wider interest.
We hope that researchers will find Nematus an accessible and well documented toolkit to support their research. The toolkit is by no means limited to research, and has been used to train MT systems that are currently in production (WIPO, 2016).
Nematus is available under a permissive BSD license.