RETURNN as a Generic Flexible Neural Toolkit with Application to Translation and Speech Recognition

We compare the fast training and decoding speed of RETURNN of attention models for translation, due to fast CUDA LSTM kernels, and a fast pure TensorFlow beam search decoder. We show that a layer-wise pretraining scheme for recurrent attention models gives over 1% BLEU improvement absolute and it allows to train deeper recurrent encoder networks. Promising preliminary results on max. expected BLEU training are presented. We are able to train state-of-the-art models for translation and end-to-end models for speech recognition and show results on WMT 2017 and Switchboard. The flexibility of RETURNN allows a fast research feedback loop to experiment with alternative architectures, and its generality allows to use it on a wide range of applications.


Introduction
RETURNN, the RWTH extensible training framework for universal recurrent neural networks, was introduced in (Doetsch et al., 2017).The source code is fully open 1 .It can use Theano (Theano Development Team, 2016) or TensorFlow (Ten-sorFlow Development Team, 2015) for its computation.Since it was introduced, it got extended by comprehensive TensorFlow support.A generic recurrent layer allows for a wide range of encoder-decoder-attention or other recurrent structures.An automatic optimization logic can optimize the computation graph depending on training, scheduled sampling, sequence training, or beam search decoding.The automatic optimization together with our fast native CUDA implemented LSTM kernels allows for very fast train-ing and decoding.We will show in speed comparisons with Sockeye (Hieber et al., 2017) that we are at least as fast or usually faster in both training and decoding.Additionally, we show in experiments that we can train very competitive models for machine translation and speech recognition.This flexibility together with the speed is the biggest strength of RETURNN.
Our focus will be on recurrent attention models.We introduce a layer-wise pretraining scheme for attention models and show its significant effect on deep recurrent encoder models.We show promising preliminary results on expected maximum BLEU training.The configuration files of all the experiments are publicly available2 .

Related work
Multiple frameworks exist for training attention models, most of which are focused on machine translation.
• Sockeye (Hieber et al., 2017) is a generic framework based on MXNet (Chen et al., 2015) which is most compareable to RE-TURNN as it is generic although we argue that RETURNN is more flexible and faster.• OpenNMT (Levin et al., 2017a,b) based on Lua (Ierusalimschy et al., 2006) which is discontinued in development.Separate PyTorch (PyTorch Development Team, 2018) and TensorFlow implementation exists, which are more recent.We will demonstrate that RE-TURNN is more flexible.• Nematus (Sennrich et al., 2017) is based on Theano (Theano Development Team, 2016) which is going to be discontinued in development.We show that RETURNN is much faster in both training and decoding as can be concluded from our speed comparison to Sockeye and the comparisons performed by the Sockeye authors (Hieber et al., 2017).• Marian (Junczys-Dowmunt et al., 2016) is implemented directly in C++ for performance reasons.Again by our speed comparisons and the comparisons performed by the Sockeye authors (Hieber et al., 2017), one can conclude that RETURNN is very competitive in terms of speed, but is much more flexible.• NeuralMonkey (Helcl and Libovickỳ, 2017) is based on TensorFlow (TensorFlow Development Team, 2015).This framework is not as flexible as RETURNN.Also here we can conclude just as before that RETURNN is much faster in both training and decoding.• Tensor2Tensor (Vaswani et al., 2018) is based on TensorFlow (TensorFlow Development Team, 2015).It comes with the reference implementation of the Transformer model (Vaswani et al., 2017), however, it lacks support for recurrent decoder models and overall is way less flexible than RETURNN.

Speed comparison
Various improved and fast CUDA LSTM kernels are available for the TensorFlow backend in RE-TURNN.A comparison of the speed of its own LSTM kernel vs. other TensorFlow LSTM kernels can be found on the website3 .In addition, an automatic optimization path which moves out computation of the recurrent loop as much as possible improves the performance.We want to compare different toolkits in training and decoding for a recurrent attention model in terms of speed on a GPU.Here, we try to maximize the batch size such that it still fits into the GPU memory of our reference GPU card, the Nvidia GTX 1080 Ti with 11 GB of memory.We keep the maximum sequence length in a batch the same, which is 60 words.We always use Adam (Kingma and Ba, 2014) for training.In Table 1, we see that RETURNN is the fastest, and also is most efficient in its memory consumption (implied by the larger batches).For these speed experiments, we did not tune any of the hyper parameters of RETURNN which explains its worse performance.The aim here is to match Sockeye's exact architecture for speed and memory com-  1: Training speed and memory consumption on WMT 2017 German→English.Train time is for seeing the full train dataset once.Batch size is in words, such that it almost maximizes the GPU memory consumption.The BLEU score is for the converged models, reported for newstest2015 (dev) and newstest2017.The encoder has one bidirectional LSTM layer and either 3 or 5 unidirectional LSTM layers.more pessimistic, i.e. the decrease is slower and it sees the data more often until convergence.This greatly increases the total training time but in our experience also improves the model.
For decoding, we extend RETURNN with a fast pure TensorFlow beam search decoder, which supports batch decoding and can run on the GPU.A speed and memory consumption comparison is shown in Table 2.We see that RETURNN is the fastest.We report results for the batch size that yields the best speed.The slow speed of Sockeye is due to frequent cross-device communication.2: Decoding speed and memory consumption on WMT 2017 German→English.Time is for decoding the whole dataset, reported for new-stest2015 (dev) and newstest2017, with beam size 12.Batch size is the number of sequences, such that it optimizes the decoding speed.This does not mean that it uses the whole GPU memory.These are the same models as in Table 1.

Performance comparison
We want to study what possible performance we can get with each framework on a specific task.We restrict this comparison here to recurrent attention models.
The first task is the WMT 2017 German to English translation task.We use the same 20K bytepair encoding subword units in all toolkits (Sen-nrich et al., 2015).We also use Adam (Kingma and Ba, 2014) in all cases.The learning rate scheduling is also similar.In RETURNN, we use a 6 layer bidirectional encoder, trained with pretraining and label smoothing.It has bidirectional LSTMs in every layer of the encoder, unlike Sockeye, which only has the first layer bidirectional.We use a variant of attention weight / fertility feedback (Tu et al., 2016), which is inverse in our case, to use a multiplication instead of a division, for better numerical stability.Our model was derived from the model presented by (Bahar et al., 2017;Peter et al., 2017) and (Bahdanau et al., 2014).
We report the best performing Sockeye model we trained, which has 1 bidirectional and 3 unidirectional encoder layers, 1 pre-attention target recurrent layer, and 1 post-attention decoder layer.We trained with a max sequence length of 75, and used the 'coverage' RNN attention type.For Sockeye, the final model is an average of the 4 best runs according to the development perplexity.The results are collected in Table 3.We obtain the best results with Sockeye using a Transformer network model (Vaswani et al., 2017) We compare RETURNN to other toolkits on the WMT 2017 English→German translation task in Table 4.We observe that our toolkit outperforms all other toolkits.The best result obtained by other toolkits is using Marian (25.5% BLEU).In comparison, RETURNN achieves 26.1%.We also compare RETURNN to the best performing single systems of WMT 2017.In comparison to the fine-tuned evaluation systems that also include back-translated data, our model performs worse by only 0.3 to 0.9 BLEU.We did not run experiments with back-translated data, which can potentially boost the performance by several BLEU points.
We also have preliminary results with recurrent attention models for speech recognition on the Switchboard task, which we trained on the 300h trainset.We report on both the Switchboard (SWB) and the CallHome (CH) part of Hub5'00 and Hub5'01.We also compare to a conventional frame-wise trained hybrid deep bidirec- English→German.The baseline systems (upper half) are trained on the parallel data of the WMT Enlgish→German 2017 task.We downloaded the hypotheses from here. 4The WMT 2017 system hypotheses (lower half) are generated using systems having additional back-translation (bt) data.These hypotheses are downloaded from here. 5ional LSTM with 6 layers (Zeyer et al., 2017b), and a generalized full-sum sequence trained hybrid deep bidirectional LSTM with 5 layers (Zeyer et al., 2017a).The frame-wise trained hybrid model also uses focal loss (Lin et al., 2017).All the hybrid models use a phonetic lexicon and an external 4-gram language model which was trained on the transcripts of both the Switchboard and the Fisher corpus.The attention model does not use any external language model nor a phonetic lexicon.Its output labels are byte-pair encoded subword units (Sennrich et al., 2015).It has a 6 layer bidirectional encoder, which also applies max-pooling in the time dimension, i.e. it reduces the input sequence by factor 8. Pretraining as explained in Section 6 was applied.To our knowledge, this is the best reported result for an end-toend system on Switchboard 300h without using a language model or the lexicon.For comparison, we also selected comparable results from the literature.From these, the Baidu DeepSpeech CTC model is modeled on characters and does not use the lexicon but it does use a language model.The results are collected in Table 5.  (Saon et al., 2017).hybrid 2 trained with Lattice-free MMI (Hadian et al., 2018).CTC 3 is the Baidu 2014 DeepSpeech model (Hannun et al., 2014).Our attention model does not use any language model.minimization, following (Prabhavalkar et al., 2017;Edunov et al., 2017).The results are still preliminary but promising.We do the approximation by beam search with beam size 4.For a 4 layer encoder network model, with forced alignment cross entropy training, we get 30.3%BLEU, and when we use maximum expected BLEU training, we get 31.1% BLEU.

Pretraining
RETURNN supports very generic and flexible pretraining which iteratively starts with a small model and adds new layers in the process.A similar pretraining scheme for deep bidirectional LSTMs acoustic speech models was presented earlier (Zeyer et al., 2017b).Here, we only study a layer-wise construction of the deep bidirectional LSTM encoder network of an encoder-decoderattention model for translation on the WMT 2017 German→English task.Experimental results are presented in Table 6.The observations very clearly match our expectations, that we can both greatly improve the overall performance, and we are able to train deeper models.A minor benefit is faster training speed of the initial pretrain epochs.In preliminary recurrent attention experiments for speech recognition, pretraining seems very essential to get good performance.
Also, we use in all cases a learning rate scheduling scheme, which lowers the learning rate if the cross validation score does not improve enough.Without pretraining and a 2 layer encoder in the same setting as above, with a fixed learning rate, we get 28.4% BLEU, where-as with learning rate scheduling, we get 29.3%BLEU.

RETURNN features
Besides the fast speed, and the many features such as pretraining, scheduled sampling (Bengio et al., 2015), label smoothing (Szegedy et al., 2016), and the ability to train state-of-the-art models, one of the greatest strengths of RETURNN is its flexibility.The definition of the recurrent dependencies and the whole model architecture are provided in a very explicit way via a config file.Thus, e.g.trying out a new kind of attention scheme, adding a new latent variable to the search space, or drastically changing the whole architecture, is all supported already and does not need any more implementation in RETURNN.All that can be expressed by the neural network definition in the config.A (simplified) example of a network definition is given in Listing 1.
Each layer in this definition does some computation, specified via the class attribute, and gets its input from other layers via the from attribute, or from the input data, in case of layer src.The output layer defines a whole subnetwork, which can make use of recurrent dependencies via a prev: prefix.Depending on whether training or decoding is done, the choice layer class would return the true labels or the predicted labels.In case of scheduled sampling or max BLEU training, we can also use the predicted label during training.Depending on this configuration, during compilation of the computation graph, RE-TURNN figures out that certain calculations can be moved out of the recurrent loop.This automatic optimization also adds to the speedup.This flexibility and ease of trying out new architectures and models allow for a very efficient development / research feedback loop.Fast, consistent and robust feedback greatly helps the productivity and quality.This is very different to other toolkits which only support a predefined set of architectures.

Conclusion
We have demonstrated many promising features of RETURNN and presented state-of-the-art systems in translation and speech recognition.We argue that it is a convenient testbed for research and applications.We introduced pretraining for recurrent attention models and showed its advantages while not having any disadvantages.Maximum expected BLEU training seems to be promising.

Table 3 :
Comparison on German→English.

Table 5 :
Performance comparison on Switchboard, trained on 300h.hybrid 1 is the IBM 2017 ResNet model