CytonMT: an Efficient Neural Machine Translation Open-source Toolkit Implemented in C++

This paper presents an open-source neural machine translation toolkit named CytonMT. The toolkit is built from scratch only using C++ and NVIDIA’s GPU-accelerated libraries. The toolkit features training efficiency, code simplicity and translation quality. Benchmarks show that cytonMT accelerates the training speed by 64.5% to 110.8% on neural networks of various sizes, and achieves competitive translation quality.

Using script languages and third-party GPU platforms is a two-edged sword. On one hand, it greatly reduces the workload of coding neural networks. On the other hand, it also causes two problems as follows, • The running efficiency drops, and profiling and optimization also become difficult, as the direct access to GPUs is blocked by the language interpreters or the platforms. NMT systems typically require days or weeks to train, so training efficiency is a paramount concern. Slightly faster training can make the difference between plausible and impossible experiments (Klein et al., 2017).
• The researchers using these toolkits may be constrained by the platforms. Unexplored computations or operations may become disallowed or unnecessarily inefficient on a third-party platform, which lowers the chances of developing novel neural network techniques.  Python  Theano  BPE-char  Python  Theano  Nematus  Python  Theano  OpenNMT  Lua  Torch  Seq2seq  Python  Tensorflow  ByteNet  Python  Tensorflow  ConvS2S  Lua  Torch  Tensor2Tensor Python Tensorflow Marian C++ -CytonMT C++ - CytonMT is developed to address this issue, in hopes of providing the community an attractive alternative. The toolkit is written in C++ which is the genuine official language of NVIDIA -the manufacturer of the most widely-used GPU hardware. This gives the toolkit an advantage on efficiency when compared with other toolkits.
Implementing in C++ also gives CytonMT great flexibility and freedom on coding. The researchers who are interested in the real calculations inside neural networks can trace source codes down to kernel functions, matrix operations or NVIDIA's APIs, and then modify them freely to test their novel ideas.
The code simplicity of CytonMT is comparable to those NMT toolkits implemented in script languages. This owes to an open-source generalpurpose neural network library in C++, named Cy-tonLib, which is shipped as part of the source code. The library defines a simple and friendly pattern for users to build arbitrary network architectures in the cost of two lines of genuine C++ code per layer.
CytonMT achieves competitive translation quality, which is the main purpose of NMT toolkits. It implements the popular framework of attention-based RNN encoder-decoder. Among the reported systems of the same architecture, it ranks at top positions on the benchmarks of both WMT14 and WMT17 English-to-German tasks.
The following of this paper presented the details of CytonMT from the aspects of method, implementation, benchmark, and future works.

Method
The toolkit approaches to the problem of machine translation using the attention-based RNN encoder-decoder proposed by Bahdanau et al. (2014) and Luong et al. (2015a). Figure 1  trates the architecture. The conditional probability of a translation given a source sentence is formulated as, where x is a source sentence; y=(y 1 , . . . , y m ) is a translation; H s is a source-side top-layer hidden state; H j t is a target-side top-layer hidden state; H j o is a state generated by an attention model F att ; W o and B o are the weight and bias of an output embedding.
The toolkit adopts the multiplicative attention model proposed by Luong et al. (2015a), because it is slightly more efficient than the additive variant proposed by Bahdanau et al. (2014). This issue is addressed in Britz et al. (2017) and Vaswani et al. (2017). Figure 2 illustrates the model, formulated as , where F a is a scoring function for alignment; W a is a matrix for linearly mapping target-side hidden states into a space comparable to the source-side; a ij st is an alignment coefficient; C j s is a sourceside context; C j st is a context derived from both sides.

Implementation
The toolkit consists of a general purpose neural network library, and a neural machine translation system built upon the library. The neural network library defines a class named Network to facilitate the construction of arbitrary neural networks. Users only need to inherit the class, declare components as data members, and write down two lines of codes per component in an initialization function. For example, the complete code of the attention network formulated by the equations 3 to 7 is presented in Figure 3. This piece of code fulfills the task of building a neural network as follows, • The class of Variable stores numeric values and gradients. Through passing the pointers of Variable around, component are connected together.
• The data member of layers collects all the components. The base class of Network will call the functions forward, backward and cal-culateGradient of each component to perform the actual computation.
The codes of actual computation are organized in the functions forward, backward and calculate-Gradient for each type of component. Figure 4 presents some examples. Note that these codes have been slightly simplified for illustration.

Settings
CytonMT is tested on the widely-used benchmarks of the WMT14 and WMT17 Englishto-German tasks (Bojar et al., 2017) (Table 2). Both datasets are processed and converted using byte-pair encoding (Gage, 1994;Schuster and Nakajima, 2012) with a shared source-target vocabulary of about 37000 tokens. The WMT14 corpora are processed by the scripts from Vaswani et al. (2017) 13 . The CytonMT is run with the hyperparameters settings presented by Table 3 unless stated otherwise. The settings provide both fast training and competitive translate quality according to our experiments on a variety of translation tasks. Dropout is applied to the hidden states between non-top recurrent layers R s , R t and output H o according to (Wang et al., 2017). Label smoothing estimates the marginalized effect of label-dropout during training, which makes models learn to be more unsure (Szegedy et al., 2016). This improved BLEU scores (Vaswani et al., 2017). Length penalty is applied using the formula in (Wu et al., 2016).

Comparison on Training Speed
Four baseline toolkits and CytonMT train models using the settings of hyperparameters in Table 3. The number of layers and the size of embeddings and hidden states varies, as large networks are often used in real-world applications to achieve higher accuracy at the cost of more running time. Table 4 presents the training speed of different toolkits measured in source tokens per second. The results show that the training speed of CytonMT is much higher than the baselines. 13 https://github.com/tensorflow/tensor2tensor 14 https://github.com/marian-nmt/marianexamples/tree/master/wmt2017-uedin   OpenNMT is the fastest baseline, while CytonMT achieves a speed up versus it by 64.5% to 110.8%. Moreover, CytonMT shows a consistent tendency to speed up more on larger networks. Table 5 compares the BLEU of CytonMT with the reported results from the systems of the same architecture (attention-based RNN encoderdecoder). BLEU is calculated on cased, tokenized text to be comparable to previous work (Sutskever et al., 2014;Luong et al., 2015b;Wu et al., 2016;Zhou et al., 2016). The settings of CytonMT on WMT14 follows   (Klein,2017) √ 18.25 OpenNMT (Klein,2017) √ 19.34 RNNsearch-LV (Jean,2015) √ 19.4 Deep-Att (Zhou,2016) 20.6 Luong-NMT (Luong,2015) √ 20.9 BPE-Char (Chung,2016) √ 21.5 Seq2seq (Britz, 2017) √ 22.19 CytonMT √ 22.67 GNMT (Wu, 2015) 24.61 WMT17 Nematus (Sennrich,2017) √  entropy of the development set is monitored every 1 12 epoch on WMT14 and every 1 36 epoch on WMT17, approximately 400K sentence pairs. If the entropy has not decreased by max(0.01 × learning rate, 0.001) in 12 times, learning rate decays by 0.7 and the training restarts from the previous best model. The whole training procedure terminates when no improvement is made during two neighboring decays of learning rate. The actual training took 28 epochs on WMT14 and 12 epochs on WMT17. Table 5 shows that CytonMT achieves the competitive BLEU points on both benchmarks. On WMT14, it is only outperformed by Google's production system (Wu et al., 2016), which is very much larger in scale and much more demanding on hardware. On WMT17, it achieves the same level of performance with Marian, which is high among the entries of WMT17 for a single system. Note that the start-of-the-art scores on these benchmarks have been recently pushed forward by novel network architectures such as Gehring et al. (2017), Vaswani et al. (2017) and

Conclusion
This paper introduces CytonMT -an opensource NMT toolkit -built from scratch only using C++ and NVIDIA's GPU-accelerated libraries. CytonMT speeds up training by more than 64.5%, and achieves competitive BLEU points on WMT14 and WMT17 corpora. The source code of CytonMT is simple because of CytonLib -an open-source general purpose neural network library -contained in the toolkit. Therefore, Cy-tonMT is an attractive alternative for the research community. We open-source this toolkit in hopes of benefiting the community and promoting the field. We look forward to hearing feedback from the community.
The future work of CytonMT will be continued in two directions. One direction is to further optimize the code for GPUs, such supporting multi-GPU. The problem we used to have is that GPUs proceed very fast in the last few years. For example, the microarchitectures of NVIDIA GPUs evolve twice during the development of Cy-tonMT, from Maxwell to Pascale, and then to Volta. Therefore, we have not explored cuttingedge GPU techniques as the coding effort may be outdated quickly. Multi-GPU machines are common now, so we plan to support them.
The other direction is to support latest NMT architectures such ConvS2S (Gehring et al., 2017) and Transformer (Vaswani et al., 2017). In these architectures, recurrent structures are replaced by convolution or attention structures. Their high performance indicates that the new structures suit the translation task better, so we also plan to support them in the future.