Findings of the Second Workshop on Neural Machine Translation and Generation

This document describes the findings of the Second Workshop on Neural Machine Translation and Generation, held in concert with the annual conference of the Association for Computational Linguistics (ACL 2018). First, we summarize the research trends of papers presented in the proceedings, and note that there is particular interest in linguistic structure, domain adaptation, data augmentation, handling inadequate resources, and analysis of models. Second, we describe the results of the workshop’s shared task on efficient neural machine translation, where participants were tasked with creating MT systems that are both accurate and efficient.


Introduction
Neural sequence to sequence models (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015) are now a workhorse behind a wide variety of different natural language processing tasks such as machine translation, generation, summarization and simplification. The 2nd Workshop on Neural Machine Translation and Generation (WNMT 2018) provided a forum for research in applications of neural models to machine translation and other language generation tasks (including summarization (Rush et al., 2015), NLG from structured data (Wen et al., 2015), dialog response generation (Vinyals and Le, 2015), among others). Overall, the workshop was held with two goals: First, it aimed to synthesize the current state of knowledge in neural machine translation and generation: This year we will continue to encourage submissions that not only advance the state of the art through algorithmic advances, but also analyze and understand the current state of the art, pointing to future research directions. Towards this goal, we received a number of high-quality research contributions on the topics of linguistic structure, domain adaptation, data augmentation, handling inadequate resources, and analysis of models, which are summarized in Section 2.
Second, it aimed to expand the research horizons in NMT: Based on panel discussions from the first workshop, we organized a shared task. Specifically, the shared task was on "Efficient NMT". The aim of this task was to focus on not only accuracy, but also memory and computational efficiency, which are paramount concerns in practical deployment settings. The workshop provided a set of baselines for the task, and elicited contributions to help push forward the Pareto frontier of both efficiency and accuracy. The results of the shared task are summarized in Section 3

Summary of Research Contributions
We published a call for long papers, extended abstracts for preliminary work, and crosssubmissions of papers submitted to other venues. The goal was to encourage discussion and interaction with researchers from related areas. We received a total of 25 submissions, out of which 16 submissions were accepted. The acceptance rate was 64%. Three extended abstracts, two crosssubmissions and eleven long papers were accepted after a process of double blind reviewing.
Most of the papers looked at the application of machine translation, but there is one paper on abstractive summarization (Fan et al., 2018) and one paper on automatic post-editing of translations (Unanue et al., 2018).
The workshop proceedings cover a wide range of phenomena relevant to sequence to sequence model research, with the contributions being concentrated on the following topics: Linguistic structure: How can we incorporate linguistic structure in neural MT or generation models? Contributions examined the effect of considering semantic role structure (Marcheggiani et al., 2018), latent structure (Bastings et al., 2018), and structured self-attention (Bisk and Tran, 2018).
Domain adaptation: Some contributions examined regularization methods for adaptation (Khayrallah et al., 2018) and "extreme adaptation" to individual speakers (Michel and Neubig, 2018) Data augmentation: A number of the contributed papers examined ways to augment data for more efficient training. These include methods for considering multiple back translations , iterative back translation (Hoang et al., 2018b), bidirectional multilingual training (Niu et al., 2018), and document level adaptation (Kothur et al., 2018) Inadequate resources: Several contributions involved settings in which resources were insufficient, such as investigating the impact of noise (Khayrallah and Koehn, 2018), missing data in multi-source settings (Nishimura et al., 2018) and one-shot learning (Pham et al., 2018).
Model analysis: There were also many methods that analyzed modeling and design decisions, including investigations of individual neuron contributions (Bau et al., 2018), parameter sharing , controlling output characteristics (Fan et al., 2018), and shared attention (Unanue et al., 2018) 3 Shared Task Many shared tasks, such as the ones run by the Conference on Machine Translation (Bojar et al., 2017), aim to improve the state of the art for MT with respect to accuracy: finding the most accurate MT system regardless of computational cost. However, in production settings, the efficiency of the implementation is also extremely important. The shared task for WNMT (inspired by the "small NMT" task at the Workshop on Asian Translation (Nakazawa et al., 2017)) was focused on creating systems for NMT that are not only accurate, but also efficient. Efficiency can include a number of concepts, including memory efficiency and computational efficiency. This task concerns itself with both, and we cover the detail of the evaluation below.

Evaluation Measures
The first step to the evaluation was deciding what we want to measure. In the case of the shared task, we used metrics to measure several different aspects connected to how good the system is. These were measured for systems that were run on CPU, and also systems that were run on GPU.

Computational Efficiency Measures:
We measured the amount of time it takes to translate the entirety of the test set on CPU or GPU. Time for loading models was measured by having the model translate an empty file, then subtracting this from the total time to translate the test set file.

Memory Efficiency Measures:
We measured: (1) the size on disk of the model, (2) the number of parameters in the model, and (3) the peak consumption of the host memory and GPU memory.
These metrics were measured by having participants submit a container for the virtualization environment Docker 1 , then measuring from outside the container the usage of computation time and memory. All evaluations were performed on dedicated instances on Amazon Web Services 2 , specifically of type m5.large for CPU evaluation, and p3.2xlarge (with a NVIDIA Tesla V100 GPU).

Data
The data used was from the WMT 2014 English-German task (Bojar et al., 2014), using the preprocessed corpus provided by the Stanford NLP

Baseline Systems
Two baseline systems were prepared: Echo: Just send the input back to the output.

Submitted Systems
Four teams, Team Amun, Team Marian, Team OpenNMT, and Team NICT submitted to the shared task, and we will summarize each below. Before stepping in to the details of each system, we first note general trends that all or many systems attempted. The first general trend was a fast C++ decoder, with Teams Amun, Marian, and NICT using the Amun or Marian decoders included in the Marian toolkit, 4 and team OpenNMT 4 https://marian-nmt.github.io using the C++-decoder decoder for OpenNMT. 5 . The second trend was the use of data augmentation techniques allowing the systems to train on data other than the true references. Teams Amun, Marian, and OpenNMT all performed model distillation (Kim and Rush, 2016), where a larger teacher model is used to train a smaller student model, while team NICT used back translation, training the model on sampled translations from the target to the source . Finally, a common optimization was the use of lower-precision arithmetic, where Teams Amun, Marian, and OpenNMT all used some variety of 16/8-bit or integer calculation, along with the corresponding optimized CPU or GPU operations. These three improvements seem to be best practices for efficient NMT implementation.

Team Amun
Team Amun's contribution (Hoang et al., 2018a) was based on the "Amun" decoder and consisted of a number of optimizations to improve translation speed on GPU. The first major unique contribution was a strategy of batching together computations from multiple hypotheses within beam search to exploit parallelism of hardware. Another contribution was a methodology to create a fused GPU kernel for the softmax calculation, that calculates all of the operations within the softmax (e.g. max, exponentiation, and sum) in a single kernel. In the end they submitted two systems, Amun-FastGRU and Amun-MLSTM, which use GRU (Cho et al., 2014) and multiplicative LSTM (Krause et al., 2016) units respectively.

Team Marian
Team Marian's system (Junczys-Dowmunt et al., 2018) used the Marian C++ decoder, and concentrated on new optimizations for the CPU. The team distilled a large self-attentional model into two types of "student" models: a smaller self-attentional model using average attention networks (Zhang et al., 2018), a new higherspeed version of the original Transformer model (Vaswani et al., 2017), and a standard RNN-based decoder. They also introduced an auto-tuning approach that chooses which of multiple matrix multiplication implementations is most efficient in the current context, then uses this implementation going forward. This resulted in the Marian-TinyRNN system using an RNN-based model, and the Marian-Trans-Small-AAN, Marian-Trans-Base-AAN, Marian-Trans-Big, Marian-Trans-Big-int8 systems, which use different varieties and sizes of self-attentional models.

Team OpenNMT
Team OpenNMT (Senellart et al., 2018) built a system based on the OpenNMT toolkit. The model was based on a large self-attentional teacher model distilled into a smaller, fast RNN-based model. The system also used a version of vocabulary selection (Shi and Knight, 2017), and a method to increase the size of the encoder but decrease the size of the decoder to improve the efficiency of beam search. They submitted two systems, OpenNMT-Small and OpenNMT-Tiny, which were two variously-sized implementations of this model.

Team NICT
Team NICT's contribution  to the shared task was centered around using self-training as a way to improve NMT accuracy without changing the architecture. Specifically, they used a method of randomly sampling pseudo-source sentences from a back-translation model  and used this to augment the data set to increase coverage. They tested two basic architectures for the actual translation model, a recurrent neural network-based model trained using OpenNMT, and a self-attentional model trained using Marian, finally submitting the self-attentional model using Marian as their sole contribution to the shared task NICT.

Shared Task Results
A brief summary of the results of the shared task (for newstest2015) can be found in Figure 1, while full results tables for all of the systems can be found in Appendix A. From this figure we can glean a number of observations. First, encouragingly all the submitted systems handily beat the baseline system in speed and accuracy.
Secondly, observing the speed/accuracy curves, we can see that Team Marian's submissions tended to carve out the Pareto frontier, indicating that the large number of optimizations that went into creating the system paid off in aggregate. Interestingly, on GPU, RNN-based systems carved out the faster but less accurate part of the Pareto curve, while on CPU self-attentional models were largely found to be more effective. None of the submissions consisted of a Transformer-style model so small that it under-performed the RNN models, but a further examination of where the curves cross (if they do) would be an interesting examination for future shared tasks.
Next, considering memory usage, we can see again that the submissions from the Marian team tend to be the most efficient. One exception is the extremely small memory system OpenNMT-Tiny, which achieves significantly lower translation accuracies, but fits in a mere 220MB of memory on the CPU.
In this first iteration of the task, we attempted to establish best practices and strong baselines upon which to build efficient test-time methods for NMT. One characteristic of the first iteration of the task was that the basic model architectures used relatively standard, with the valuable contributions lying in solid engineering work and best practices in neural network optimization such as low-precision calculation and model distillation. With these contributions, we now believe we have very strong baselines upon which future iterations of the task can build, examining novel architectures or methods for further optimizing the training speed. We also will examine other considerations, such as efficient adaptation to new training data, or latency from receiving a sentence to translating it.

Conclusion
This paper summarized the results of the Second Workshop on Neural Machine Translation and Generation, where we saw a number of research advances, particularly in the area of efficiency in neural MT through submissions to the shared task. The workshop series will continue next year, and continue to push forward the state of the art on these topics for faster, more accurate, more flexible, and more widely applicable neural MT and generation systems. Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accelerating neural transformer via an average attention network. In Proc. ACL.

A Full Shared Task Results
For completeness, in this section we add tables of the full shared task results. These include the full size of the image file for the translation system (Table 1), the comparison between compute time and evaluation scores on CPU (Table 2) and GPU (Table 3), and the comparison between memory and evaluation scores on CPU (Table 4) and GPU (Table 5).