Joey NMT: A Minimalist NMT Toolkit for Novices

We present Joey NMT, a minimalist neural machine translation toolkit based on PyTorch that is specifically designed for novices. Joey NMT provides many popular NMT features in a small and simple code base, so that novices can easily and quickly learn to use it and adapt it to their needs. Despite its focus on simplicity, Joey NMT supports classic architectures (RNNs, transformers), fast beam search, weight tying, and more, and achieves performance comparable to more complex toolkits on standard benchmarks. We evaluate the accessibility of our toolkit in a user study where novices with general knowledge about Pytorch and NMT and experts work through a self-contained Joey NMT tutorial, showing that novices perform almost as well as experts in a subsequent code quiz. Joey NMT is available at https://github.com/joeynmt/joeynmt.


Introduction
Since the first successes of neural machine translation (NMT), various research groups and industry labs have developed open source toolkits specialized for NMT, based on new open source deep learning platforms. While toolkits like OpenNMT (Klein et al., 2018), XNMT (Neubig et al., 2018) and Neural Monkey (Helcl and Libovický, 2017) aim at readability and extensibility of their codebase, their target group are researchers with a solid background in machine translation and deep learning, and with experience in navigating, understanding and handling large code bases. However, none of the existing NMT tools has been designed primarily for readability or accessibility for novices, nor has anyone studied quality and accessibility of such code empirically. On the other hand, it is an important challenge for novices to understand how NMT is implemented, what features each toolkit implements exactly, and which toolkit to choose in order to code their own project as fast and simple as possible.
We present an NMT toolkit especially designed for novices, providing clean, well documented, and minimalistic code, that is yet of comparable quality to more complex codebases on standard benchmarks. Our approach is to identify the core features of NMT that have not changed over the last years, and to invest in documentation, simplicity and quality of the code. These core features include standard network architectures (RNN, transformer, different attention mechanisms, input feeding, configurable encoder/decoder bridge), standard learning techniques (dropout, learning rate scheduling, weight tying, early stopping criteria), and visualization/monitoring tools.
We evaluate our codebase in several ways: Firstly, we show that Joey NMT's comment-tocode ratio is almost twice as high as other toolkits which are roughly 9-10 times larger. Secondly, we present an evaluation on standard benchmarks (WMT17, IWSLT) where we show that the core architectures implemented in Joey NMT achieve comparable performance to more complex stateof-the-art toolkits. Lastly, we conduct a user study where we test the code understanding of novices, i.e. students with basic knowledge about NMT and PyTorch, against expert coders. While novices, after having worked through a selfcontained Joey NMT tutorial, needed more time to answer each question in an in-depth code quiz, they achieved only marginally lower scores than the experts. To our knowledge, this is the first user study on the accessibility of NMT toolkits.

NMT Architectures
This section formalizes the Joey NMT implementation of autoregressive recurrent and fully-attentional models.
In the following, a source sentence of length l x is represented by a sequence of one-hot encoded vectors x 1 , x 2 , . . . , x lx for each word. Analogously, a target sequence of length l y is represented by a sequence of one-hot encoded vectors y 1 , y 2 , . . . , y ly .
Encoder. The encoder RNN transforms the input sequence x 1 , . . . , x lx into a sequence of vectors h 1 , . . . , h lx with the help of the embeddings matrix E src and a recurrent computation of states The RNN consists of either GRU or a LSTM units. For a bidirectional RNN, hidden states from both directions are are concatenated to form h i . The initial encoder hidden state h 0 is a vector of zeros. Multiple layers can be stacked by using each resulting output sequence h 1 , . . . , h lx as the input to the next RNN layer.
Decoder. The decoder uses input feeding (Luong et al., 2015) where an attentional vectors is concatenated with the representation of the previous word as input to the RNN. Decoder states are computed as follows: The initial decoder state is configurable to be either a non-linear transformation of the last encoder state ("bridge"), or identical to the last encoder state ("last"), or a vector of zeros.
Attention. The context vector c t is computed with an attention mechanism scoring the previous decoder state s t−1 and each encoder state h i : k exp(score(s t−1 , h k )) where the scoring function is a multi-layer perceptron (Bahdanau et al., 2015) or a bilinear transformation (Luong et al., 2015).
Output. The output layer produces a vector o t = W outst , which contains a score for each token in the target vocabulary. Through a softmax transformation, these scores can be interpreted as a probability distribution over the target vocabulary V that defines an index over target tokens v j .

Transformer
Joey NMT implements the Transformer from Vaswani et al. (2017), with code based on The Annotated Transformer blog (Rush, 2018).
Encoder. Given an input sequence x 1 , . . . , x lx , we look up the word embedding for each input word using E src x i , add a position encoding to it, and stack the resulting sequence of word embeddings to form matrix X ∈ R lx×d , where l x is the sentence length and d the dimensionality of the embeddings.
We define the following learnable parameters: 1 where d a is the dimensionality of the attention (inner product) space and d o the output dimensionality. Transforming the input matrix with these matrices into new word representations H H = softmax XA B X self-attention XC which have been updated by attending to all other source words. Joey NMT implements multiheaded attention, where this transformation is computed k times, one time for each head with different parameters A, B, C.
After computing all k Hs in parallel, we concatenate them and apply layer normalization and a final feed-forward layer: Multiple of these layers can be stacked by setting X = H (enc) and repeating the computation.
Decoder. The Transformer decoder operates in a similar way as the encoder, but takes the stacked target embeddings Y ∈R ly×d as input: For each target position attention to future input words is inhibited by setting those attention scores to −inf before the softmax. After obtaining H = H + Y , and before the feed-forward layer, we compute multi-headed attention again, but now between intermediate decoder representations H and final encoder representations H (enc) : We predict target words with H (dec) W out .

Features
In the spirit of minimalism, we follow the 80/20 principle (Pareto, 1896) and aim to achieve 80% of the translation quality with 20% of a common toolkit's code size. For this purpose we identified the most common features (the bare necessities) in recent works and implementations. 2 It includes standard architectures (see §2.1), label smoothing, dropout in multiple places, various attention mechanisms, input feeding, configurable encoder/decoder bridge, learning rate scheduling, weight tying, early stopping criteria, beam search decoding, an interactive translation mode, visualization/monitoring of learning progress and attention, checkpoint averaging, and more.

Documentation
The code itself is documented with doc-strings and in-line comments (especially for tensor shapes), and modules are tested with unit tests. The documentation website 3 contains installation instructions, a walk-through tutorial for training, tuning and testing an NMT model on a toy task 4 , an overview of code modules, and a detailed API documentation. In addition, we provide thorough  answers to frequently asked questions regarding usage, configuration, debugging, implementation details and code extensions, and recommend resources, such as data collections, PyTorch tutorials and NMT background material.

Code Complexity
In order to facilitate fast code comprehension and navigation (Wiedenbeck et al., 1999), Joey NMT objects have at most one level of inheritance. Table 1 compares Joey NMT with OpenNMT-py and XNMT (selected for their extensibility and thoroughness of documentation) in terms of code statistics, i.e. lines of Python code, lines of comments and number of files. 5 OpenNMT-py and XNMT have roughly 9-10x more lines of code, spread across 4-5x more files than Joey NMT . These toolkits cover more than the essential features for NMT (see §2.2), in particular for other generation or classification tasks like image captioning and language modeling. However, Joey NMT's comment-to-code ratio is almost twice as high, which we hope will give code readers better guidance in understanding and extending the code.

Benchmarks
Our goal is to achieve a performance that is comparable to other NMT toolkits, so that novices can start off with reliable benchmarks that are trusted by the community. This will allow them to build on Joey NMT for their research, should they want to do so. We expect novices to have limited resources available for training, i.e., not more than one GPU for a week, and therefore we focus on benchmarks that are within this scope. Pretrained models, data preparation scripts and configuration files for the following benchmarks will be made available on https://github.com/ joeynmt/joeynmt.  WMT17. We use the settings of Hieber et al. (2018), using the exact same data, pre-processing, and evaluation using WMT17-compatible Sacre-BLEU scores (Post, 2018). 6 We consider the setting where toolkits are used out-of-the-box to train a Groundhog-like model (1-layer LSTMs, MLP attention), the 'best found' setting where Hieber et al. train each model using the best settings that they could find, and the Transformer base setting. 7 Table 2 shows that Joey NMT performs very well compared against other shallow, deep and Transformer models, despite its simple code base. 8 IWSLT14. This is a popular benchmark because of its relatively small size and therefore fast training time. We use the data, pre-processing, and word-based vocabulary of Wiseman and Rush (2016) and evaluate with SacreBLEU. 9 Table 3 shows that Joey NMT performs well here, with both its recurrent and its Transformer model. We also included BPE results for future reference.

User Study
The target group for Joey NMT are novices who will use NMT in a seminar project, a thesis, or an internship. Common tasks are to re-implement a paper, extend standard models by a small novel element, or to apply them to a new task. In order to evaluate how well novices understand Joey NMT, we conducted a user study comparing the code comprehension of novices and experts.

Study Design
Participants. The novice group is formed of eight undergraduate students with a Computational Linguistics major that have all passed introductory courses to Python and Machine Learning, three of them also a course about Neural Networks. None of them had practical experience with training or implementing NMT models nor PyTorch, but two reported theoretic understanding of NMT. They attended a 20h crash course introducing NMT and Pytorch basics. 10 Note that we did not teach Joey NMT explicitly in class, but the students independently completed the Joey NMT tutorial.
As a control group (the "experts"), six graduate students with NMT as topic of their thesis or research project participated in the study. In contrast to the novices, this group of participants has a solid background in Deep Learning and NMT, had practical experience with NMT. All of them had previously worked with NMT in PyTorch.
Conditions. The participation in the study was voluntary and not graded. Participants were not allowed to work in groups and had a maximum time of 3h to complete the quiz. They had previously locally installed Joey NMT 11 and could browse the code with the tools of their choice (IDE or text editor). They were instructed to explore the Joey NMT code with the help of the quiz, informed about the purpose of the study, and agreed to the use of their data in this study. Both groups of participants had to learn about Joey NMT in a self-guided manner, using the same tutorial, code, and documentation. The quiz was executed on the university's internal e-learning platform. Participants could jump between questions, review their answers before finally submitting all answers and could take breaks (without stopping the timer). Answers to the questions were published after all students had completed the test.
Question design. The questions are not designed to test the participant's prior knowledge on the topic, but to guide their exploration of the code. The questions are either free text, multiple choice or binary choice. There are three blocks of questions: 12 1. Usage of Joey NMT : nine questions on how to interpret logs, check whether models were saved, interpret attention matrices, pre-/postprocess, and to validate whether the model is doing what it is built for.
2. Configuring Joey NMT : four questions that make the users configure Joey NMT in such a way that it works for custom situations, e.g. with custom data, with a constant learning rate, or creating model of desired size.
Every question is awarded one point if answered correctly. Some questions require manual grading, most of them have one correct answer. We record overall completion time and time per question. 13

Analysis
Total duration and score. Experts took on average 77 min to complete the quiz, novices 118 min, which is significantly slower (one-tailed ttest, p < 0.05). Experts achieved on average 82% of the total points, novices 66%. According to the t-test the difference in total scores between groups is significant at p < 0.05. An ANOVA reveals that there is a significant difference in total duration and scores within the novices group, but not within the experts group.
Per question analysis. No question was incorrectly answered by everyone. Three questions (#6, #11, #18) were correctly answered by everyonethey were appeared to be easiest to answer and did not require deep understanding of the code. In addition, seven questions (#1, #13, #15, #21, #22, #28, #29) were correctly answered by all experts, but not all novices-here their NMT experience was useful for working with hyperparameters and peculiarities like special tokens. However, for only one question, regarding the differences in data processing between training and validation (#16), the difference between average expert and novice score was significant (at p < 0.05). Six questions (#9, #18, #21, #25, #31) show a significantly longer average duration for novices than experts. These questions concerned post-processing, initialization, batching, end conditions for training termination and plotting, and required detailed code inspection.
LME. In order to analyze the dependence of scores and duration on particular questions and individual users, we performed a linear mixed effects (LME) analysis using the R library lme4 (Bates et al., 2015). Participants and questions are treated as random effects (categorical), the level of expertise as fixed effect (binary). Duration and score per question are response variables. 14 For both response variables the variability is higher depending on the question than on the user (6x higher for score, 2x higher for time). The intercepts of the fixed effects show that novices score on average 0.14 points less while taking 2.47 min longer on each question than experts. The impact of the fixed effect is significant at p < 0.05.

Findings
First of all, we observe that the design of the questions was engaging enough for the students because all participants invested at least 1h to complete the quiz voluntarily. The experts also reported having gained new insights into the code through the quiz. We found that there are significant differences between both groups: Most prominently, the novices needed more time to answer each question, but still succeeded in answering the majority of questions correctly. There are larger variances within the group of novices, because they had to develop individual strategies to explore the code and use the available resources (documentation, code search, IDE), while experts could in many cases rely on prior knowledge.

Conclusion
We presented Joey NMT, a toolkit for sequenceto-sequence learning designed for NMT novices. It implements the most common NMT features and achieves performance comparable to more complex toolkits, while being minimalist in its design and code structure. In comparison to other toolkits, it is smaller in size and but more extensively documented. A user study on code accessibility confirmed that the code is comprehensibly written and structured. We hope that Joey NMT will ease the burden for novices to get started with NMT, and can serve as a basis for teaching.