Findings of the Third Workshop on Neural Generation and Translation

This document describes the findings of the Third Workshop on Neural Generation and Translation, held in concert with the annual conference of the Empirical Methods in Natural Language Processing (EMNLP 2019). First, we summarize the research trends of papers presented in the proceedings. Second, we describe the results of the two shared tasks 1) efficient neural machine translation (NMT) where participants were tasked with creating NMT systems that are both accurate and efficient, and 2) document generation and translation (DGT) where participants were tasked with developing systems that generate summaries from structured data, potentially with assistance from text in another language.


Introduction
Neural sequence to sequence models (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015) are now a workhorse behind a wide variety of different natural language processing tasks such as machine translation, generation, summarization and simplification. The 3rd Workshop on Neural Machine Translation and Generation (WNGT 2019) provided a forum for research in applications of neural models to machine translation and other language generation tasks (including summarization (Rush et al., 2015), NLG from structured data (Wen et al., 2015), dialog response generation (Vinyals and Le, 2015), among others). Overall, the workshop was held with two goals.
First, it aimed to synthesize the current state of knowledge in neural machine translation and generation: this year we continued to encourage submissions that not only advance the state of the art through algorithmic advances, but also analyze and understand the current state of the art, pointing to future research directions. Towards this goal, we received a number of high-quality research contributions on both workshop topics, as summarized in Section 2.
Second, the workshop aimed to expand the research horizons in NMT: we continued to organize the Efficient NMT task which encouraged participants to develop not only accurate but computationally efficient systems. In addition, we organized a new shared task on "Document-level Generation and Translation", which aims to push forward document-level generation technology and contrast the methods for different types of inputs. The results of the shared task are summarized in Sections 3 and 4.

Summary of Research Contributions
We published a call for long papers, extended abstracts for preliminary work, and crosssubmissions of papers submitted to other venues. The goal was to encourage discussion and interaction with researchers from related areas.
We received a total of 68 submissions, from which we accepted 36. There were three crosssubmissions, seven long abstracts and 26 full papers. There were also seven system submission papers. All research papers were reviewed twice through a double blind review process, and avoiding conflicts of interest.
There were 22 papers with an application to generation of some kind, and 14 for translation which is a switch from previous workshops where the focus was on machine translation. The caliber of the publications was very high and the number has more than doubled from last year (16 accepted papers from 25 submissions).
2 3 Shared Task: Document-level Generation and Translation The first shared task at the workshop focused on document-level generation and translation. Many recent attempts at NLG have focused on sentencelevel generation (Lebret et al., 2016;Gardent et al., 2017). However, real world language generation applications tend to involve generation of much larger amount of text such as dialogues or multisentence summaries. The inputs to NLG systems also vary from structured data such as tables (Lebret et al., 2016) or graphs (Wang et al., 2018), to textual data (Nallapati et al., 2016). Because of such difference in data and domain, comparison between different methods has been nontrivial. This task aims to (1) push forward such document-level generation technology by providing a testbed, and (2) examine the differences between generation based on different types of inputs including both structured data and translations in another language.
In particular, we provided the following 6 tracks which focus on different input/output requirements: • NLG (Data → En, Data → De): Generate document summaries in a target language given only structured data.
• MT (De ↔ En): Translate documents in the source language to the target language.
• MT+NLG (Data+En → De, Data+De → En): Generate document summaries given the structured data and the summaries in another language.

Evaluation Measures
We employ standard evaluation metrics for datato-text NLG and MT along two axes: Textual Accuracy Measures: We used BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) as measures for texutal accuracy compared to reference summaries.

Content Accuracy Measures:
We evaluate the fidelity of the generated content to the input data using relation generation (RG), content selection (CS), and content ordering (CO) metrics (Wiseman et al., 2017). The content accuracy measures were calculated using information extraction models trained on respective target languages. We followed (Wiseman et al., 2017) and ensembled 6 information extraction models (3 CNN-based, 3 LSTM-based) with different random seeds for each language.

Data
Due to the lack of a document-level parallel corpus which provides structured data for each instance, we took an approach of translating an existing NLG dataset. Specifically, we used a subset of the RotoWire dataset (Wiseman et al., 2017) and obtained professional German translations, which are sentence-aligned to the original English articles. The obtained parallel dataset is called the RotoWire English-German dataset, and consists of box score tables, an English article, and its German translation for each instance. Table 1 shows the statistics of the obtained dataset. We used the test split from this dataset to calculate the evaluation measures for all the tracks.
Systems which follow these resource constraints are marked constrained, otherwise unconstrained. Results are indicated by the initials (C/U).

Baseline Systems
Considering the difference in inputs for MT and NLG tracks, we prepared two baselines for respective tracks.

Submitted Systems
Four teams, Team EdiNLG, Team FIT-Monash, Team Microsoft, Team Naver Labs Europe, and Team SYSTRAN-AI participated in the shared task. We note the common trends across many teams and discuss the systems of individual teams below. On MT tracks, all the teams have adopted a variant of Transformer (Vaswani et al., 2017) as a sequence transduction model and trained on corpora with different data-augmentation methods. Trained systems were then fine-tuned on in-domain data including our RotoWire English-German dataset. The focus of data augmentation was two-fold: 1) acquiring in-domain data and 2) utilizing document boundaries from existing corpora. Most teams applied back-translation on various sources including NewsCrawl and the original RotoWire dataset for this purpose. NLG tracks exhibited a similar trend for the sequence model selection, except for Team EdiNLG who employed LSTM.

Team EdiNLG
Team EdiNLG built their NLG system upon (Puduppully et al., 2019) by extending it to further allow copying from the table in addition to generating from vocabulary and the content plan. Additionally, they included features indicating the win/loss team records and team rank in terms of points for each player. They trained the NLG model for both languages together, using a shared BPE vocabulary obtained from target game summaries and by prefixing the target text with the target language indicator.
For MT and MT+NLG tracks, they mined the in-domain data by extracting basketball-related texts from Newscrawl when one of the following conditions are met: 1) player names from the RotoWire English-German training set appear, 2) two NBA team names appear in the same document, or 3) "NBA" appears in titles. This resulted in 4.3 and 1.1 million monolingual sentences for English and German, respectively. The obtained sentences were then back-translated and added to the training corpora. They submitted their system EdiNLG in all six tracks.

Team FIT-Monash
Team FIT-Monash built a document-level NMT system (Maruf et al., 2019) and participated in MT tracks. The document-level model was initialized with a pre-trained sentence-level NMT model on news domain parallel corpora. Two strategies for composing document-level context were proposed: flat and hierarchical attention. Flat attention was applied on all the sentences, while hierarchical attention was computed at sentence and word-level in a hierarchical manner. Sparse attention was applied at sentence-level in order to identify key sentences that are important for translating the current sentence.
To train a document-level model, the team focused on corpora that have document boundaries, including News Commentary, Rapid, and the Ro-toWire dataset. Notably, greedy decoding was employed due to computational cost. The submitted system is an ensemble of three runs indicated as FIT-Monash.

Team Microsoft
Team Microsoft (MS) developed a Transformerbased NLG system which consists of two sequence-to-sequence models. The two step method was inspired by the approach from (Puduppully et al., 2019), where the first model is a recurrent pointer network that selects encoded records, and the second model takes the selected content representation as input and generates summaries. The proposed model (MS-End-to-End) learned both models at the same time with a combined loss function. Additionally, they have investigated the use of pre-trained language models for NLG track. Specifically, they fine-tuned GPT-2 (Radford et al., 2019) on concatenated pairs of (template, target) summaries, while constructing templates following (Wiseman et al., 2017). The two sequences are concatenated around a special token which indicates "rewrite". At decoding time, they adopted nucleus sampling (Holtzman et al., 2019) to enhance the generation quality. Different thresholds for nucleus sampling were investigated, and two systems with different thresholds were submitted: MS-GPT-50 and MS-GPT-90, where the numbers refer to Top-p thresholds.
The generated summaries in English using the following systems were then translated with the MT systems which is described below. Hence, this marks Team Microsoft's German NLG (Data → De) submission unconstrained, due to the usage of parallel data beyond the RotoWire English-German dataset.
As for the MT model, a pre-trained system from (Xia et al., 2019) was fine-tuned on the Ro-toWire English-German dataset, as well as backtranslated sentences from the original RotoWire dataset for the English-to-German track. Backtranslation of sentences obtained from Newscrawl according to the similarity to RotoWire data (Moore and Lewis, 2010) was attempted but did not lead to improvement. The resulting system is shown as MS on MT track reports.

Team Naver Labs Europe
Team Naver Labs Europe (NLE) took the approach of transferring the model from MT to NLG. They first trained a sentence-level MT model by iteratively extend the training set from the WMT19 parallel data and RotoWire English-German dataset to back-translated Newscrawl data. The best sentence-level model was then fine-tuned at document-level, followed by finetuning on the RotoWire English-German dataset (constrained NLE) and additionally on the backtranslated original RotoWire dataset (unconstrained NLE).
To fully leverage the MT model, input record values prefixed with special tokens for record types were sequentially fed in a specific order. Combined with the target summary, the pair of record representations and the target summaries formed data for a sequence-to-sequence model. They fine-tuned their document-level MT model on these NLG data which included the original RotoWire and RotoWire English-German dataset.
The team tackled MT+NLG tracks by concatenating source language documents and the sequence of records as inputs. To encourage the model to use record information more, they randomly masked certain portion of tokens in the source language documents.

Team SYSTRAN-AI
Team SYSTRAN-AI developed their NLG system based on the Transformer (Vaswani et al., 2017). The model takes as input each record from the box score featurized into embeddings and decode the summary. In addition, they introduced a content selection objective where the model learns to predict whether or not each record is used in the summary, comprising a sequence of binary classfication decision.
Furthermore, they performed data augmentation by synthesizing records whose numeric values were randomly changed in a way that does not change the win / loss relation and remains within a sane range. The synthesized records were used to generate a summary to obtain new (record, summary) pairs and were included added the training data. To bias the model toward generating more records, they further fine-tuned their model on a subset of training examples which contain N (= 16) records in the summary. The submitted systems are SYSTRAN-AI and SYSTRAN-AI-Detok, which differ in tokenization.

Results
We show the results for each track in Table 2 through 7. In the NLG and MT+NLG tasks, we report BLEU, ROUGE (F1) for textual accuracy, RG (P), CS(P, R), and CO (DLD) for content accuracy. While for MT tasks, we only report BLEU. We summarize the shared task results for each track below.
In NLG (En) track, all the participants encouragingly submitted systems outperforming a strong baseline by (Puduppully et al., 2019). We observed an apparent difference between the constrained and unconstrained settings. Team NLE's approach showed that pre-training of the document-level generation model on news corpora is effective even if the source input differs (German text vs linearized records). Among constrained systems, it is worth noting that all the systems but Team EdiNLG used the Transformer, but the result did not show noticeable improvements compared to EdiNLG. It was also shown that the generation using pre-trained language models is sensitive to how the sampling is performed; the results of MS-GPT-90 and MS-GPT-50 differ only in the nucleus sampling hyperparameter, which led to significant differences in every evaluation measure.
The NLG (De) track imposed a greater challenge compared to its English counterpart due to the lack of training data. The scores has generally dropped compared to NLG (En) results. To alleviate the lack of German data, most teams developed systems under unconstrained setting by utilizing MT resources and models. Notably, Team NLE's has achieved similar performance to the constrained system results on NLG (En). However, Team EdiNLG achieved similar performance under the constrained setting by fully leveraging the original RotoWire using the sharing of vocabulary.
In MT tracks, we see the same trend that the system under unconstrained setting (NLE) outperformed all the systems under the constrained setting. The improvement observed in the unconstrained setting came from fine-tuning on the back-translated original RotoWire dataset, which offers purely in-domain parallel documents.
While the results are not directly comparable due to different hyperparameters used in systems, fine-tuning on in-domain parallel sentences was shown effective (FairSeq-19 vs others). When incorporating document-level data, it was shown that document-level models (NLE, FIT-Monash, MS) perform better than sentence-level models (EdiNLG,, even if a sentence-level model is trained on document-aware corpora. For MT+NLG tracks, interestingly, no teams found the input structured data useful, thus applying MT models for MT+NLG tracks. Compared to the baseline (FairSeq-19), fine-tuning on indomain data resulted in better performance overall as seen in the results of Team MS and NLE.
The key difference between Team MS and NLE is the existence of document-level fine-tuning, where Team NLE outperformed in terms of textual accuracy (BLEU and ROUGE) overall, in both target languages.

Shared Task: Efficient NMT
The second shared task at the workshop focused on efficient neural machine translation. Many MT shared tasks, such as the ones run by the Conference on Machine Translation (Bojar et al., 2017), aim to improve the state of the art for MT with respect to accuracy: finding the most accurate MT system regardless of computational cost. However, in production settings, the efficiency of the implementation is also extremely important. The efficiency shared task for WNGT (inspired by the "small NMT" task at the Workshop on Asian Translation (Nakazawa et al., 2017)) was focused on creating systems for NMT that are not only accurate, but also efficient. Efficiency can include a number of concepts, including memory efficiency and computational efficiency. This task concerns itself with both, and we cover the detail of the evaluation below.

Evaluation Measures
We used metrics to measure several different aspects connected to how good the system is. These were measured for systems that were run on CPU, and also systems that were run on GPU.

Computational Efficiency Measures:
We measured the amount of time it takes to translate the entirety of the test set on CPU or GPU. Time for loading models was measured by having the model translate an empty file, then subtracting this from the total time to translate the test set file.

Memory Efficiency Measures:
We measured: (1) the size on disk of the model, (2) the number of parameters in the model, and (3) the peak consumption of the host memory and GPU memory.
These metrics were measured by having participants submit a container for the virtualization environment Docker 1 , then measuring from outside the container the usage of computation time and memory. All evaluations were performed on dedicated instances on Amazon Web Services 2 , specifically of type m5.large for CPU evaluation, and p3.2xlarge (with a NVIDIA Tesla V100 GPU).

Data
The data used was from the WMT 2014 English-German task (Bojar et al., 2014), using the preprocessed corpus provided by the Stanford NLP Group 3 . Use of other data was prohibited.

Baseline Systems
Two baseline systems were prepared: Echo: Just send the input back to the output.

Submitted Systems
Two teams, Team Marian and Team Notre Dame submitted to the shared task, and we will summarize each below.

Team Marian
Team Marian's submission (Kim et al., 2019) was based on their submission to the shared task the previous year, consisting of Transformer models optimized in a number of ways (Junczys-Dowmunt et al., 2018). This year, they made      a number of improvements. Improvements were made to teacher-student training by (1) creating more data for teacher-student training using backward, then forward translation, (2) using multiple teachers to generate better distilled data for training student models. In addition, there were modeling improvements made by (1) replacing simple averaging in the attention layer with an efficiently calculable "simple recurrent unit," (2) parameter tying between decoder layers, which reduces memory usage and improves cache locality on the CPU. Finally, a number of CPU-specific optimizations were performed, most notably including 8-bit matrix multiplication along with a flexible quantization scheme.

Team Notre Dame
Team Notre Dame's submission (Murray et al., 2019) focused mainly on memory efficiency. They did so by performing "Auto-sizing" of the transformer network, applying block-sparse regularization to remove columns and rows from the parameter matrices.

Results
A brief summary of the results of the shared task (for newstest2015) can be found in Figure 1, while full results tables for all of the systems can be found in Appendix A. From this figure we can glean a number of observations.
For the CPU systems, all submissions from the Marian team clearly push the Pareto frontier in terms of both time and memory. In addition, the Marian systems also demonstrated a good tradeoff between time/memory and accuracy.
For the GPU systems, all systems from the Marian team also outperformed other systems in terms of the speed-accuracy trade-off. However, the Marian systems had larger memory consumption than both Notre Dame systems, which specifically optimized for memory efficiency, and all previous systems. Interestingly, each GPU system by the Marian team shares almost the same amount of GPU memory as shown in Table 12 and Figure  2(b). This may indicate that the internal framework of the Marian system tries to reserve enough amount of the GPU memory first, then use the acquired memory as needed by the translation processes.
On the other hand, we can see that the Notre Dame systems occupy only a minimal amount of GPU memory, as the systems use much smaller