Findings of the E2E NLG Challenge

This paper summarises the experimental setup and results of the first shared task on end-to-end (E2E) natural language generation (NLG) in spoken dialogue systems. Recent end-to-end generation systems are promising since they reduce the need for data annotation. However, they are currently limited to small, delexicalised datasets. The E2E NLG shared task aims to assess whether these novel approaches can generate better-quality output by learning from a dataset containing higher lexical richness, syntactic complexity and diverse discourse phenomena. We compare 62 systems submitted by 17 institutions, covering a wide range of approaches, including machine learning architectures – with the majority implementing sequence-to-sequence models (seq2seq) – as well as systems based on grammatical rules and templates.


Introduction
This paper summarises the first shared task on end-to-end (E2E) natural language generation (NLG) in spoken dialogue systems (SDSs). Shared tasks have become an established way of pushing research boundaries in the field of natural language processing, with NLG benchmarking tasks running since 2007 (Belz and Gatt, 2007). This task is novel in that it poses new challenges for recent end-to-end, data-driven NLG systems for SDSs which jointly learn sentence planning and surface realisation and do not require costly semantic alignment between meaning representations (MRs) and the corresponding natural language reference texts, e.g. (Dušek and Jurčíček, 2015;Wen et al., 2015b;Mei et al., 2016;Wen et al., 2016;Sharma et al., 2016;Dušek and Jurčíček, 2016a;Lampouras and Vlachos, 2016). 1 So far, end-to-end approaches to NLG are limited to small, delexicalised datasets, e.g. BAGEL (Mairesse et al., 2010), SF Hotels/Restaurants (Wen et al., 2015b), or RoboCup (Chen and Mooney, 2008), whereas the E2E shared task is based on a new crowdsourced dataset of 50k instances in the restaurant domain, which is about 10 times larger and also more complex than previous datasets. For the shared challenge, we received 62 system submissions by 17 institutions from 11 countries, with about 1/3 of these submissions coming from industry. We assess the submitted systems by comparing them to a challenging baseline using automatic as well as human evaluation. We consider this level of participation an unexpected success, which underlines the timeliness of this task. 2 While there are previous studies comparing a limited number of end-to-end NLG approaches (Novikova et al., 2017a;Wiseman et al., 2017;Gardent et al., 2017), this is the first research to evaluate novel end-to-end generation at scale and using human assessment.

Data Collection Procedure
In order to maximise the chances for data-driven end-to-end systems to produce high quality output, we aim to provide training data in high quality and large quantity. To collect data in large enough quantity, we use crowdsourcing with automatic 1 Note that as opposed to the "classical" definition of NLG (Reiter and Dale, 2000;Gatt and Krahmer, 2018), generation for dialogue systems does not involve content selection and its sentence planning stage may be less complex.
2 In comparison, the well established Conference in Machine Translation WMT'17 (running since 2006) received submissions from 31 institutions to a total of 8 tasks (Bojar et al., 2017a

Reference
The wrestlers offers competitive prices, but isn't highly rated by customers. Figure 1: Example of an MR-reference pair.  quality checks. We use MRs consisting of an unordered set of attributes and their values and collect multiple corresponding natural language texts (references) -utterances consisting of one or several sentences. An example MR-reference pair is shown in Figure 1, Table 1 lists all the attributes in our domain. In contrast to previous work (Mairesse et al., 2010;Wen et al., 2015a;Dušek and Jurčíček, 2016), we use different modalities of meaning representation for data collection: textual/logical and pictorial MRs. The textual/logical MRs (see Figure 1) take the form of a sequence with attributevalue pairs provided in a random order. The pictorial MRs (see Figure 2) are semi-automatically generated pictures with a combination of icons corresponding to the appropriate attributes. The icons are located on a background showing a map of a city, thus allowing to represent the meaning of attributes area and near (cf. Table 1).
In a pre-study (Novikova et al., 2016), we showed that pictorial MRs provide similar collection speed and utterance length, but are less likely to prime the crowd workers in their lexical choices. Utterances produced using pictorial MRs were considered to be more informative, natural and better phrased. However, while pictorial MRs provide more variety in the utterances, this also introduces noise. Therefore, we decided to use pictorial MRs to collect 20% of the dataset.
Our crowd workers were asked to verbalise all information from the MR; however, they were not

Data Statistics
The resulting dataset (Novikova et al., 2017b) contains over 50k references for 6k distinct MRs (cf. Table 2), which is 10 times bigger than previous sets in comparable domains (BAGEL, SF Hotels/Restaurants, RoboCup). The dataset contains more human references per MR (8.27 on average), which should make it more suitable for data-driven approaches. However, it is also more challenging as it uses a larger number of sentences in references (up to 6 compared to 1-2 in other sets) and more attributes in MRs.
For the E2E challenge, we split the data into training, development and test sets (in a roughly 82-9-9 ratio  System architectures are coded with colours and symbols: ♥ seq2seq, ♦ other data-driven, ♣ rule-based, ♠ template-based. Unless noted otherwise, all data-driven systems use partial delexicalisation (with name and near attributes replaced by placeholders during generation), template-and rule-based systems delexicalise all attributes. In addition to word-overlap metrics (see Section 4.1), we show the average of all metrics' values normalised into the 0-1 range, and use this to sort the list. Any values higher than the baseline are marked in bold.

Systems in the Competition
The interest in the E2E Challenge has by far exceeded our expectations. We received a total of 62 submitted systems by 17 institutions (about 1/3 from industry). In accordance with ethical considerations for NLP shared tasks (Parra Escartín et al., 2017), we allowed researchers to withdraw or anonymise their results if their system performs in the lower 50% of submissions. Two groups from industry withdrew their submissions and one group asked to be anonymised after obtaining automatic evaluation results. We asked each of the remaining teams to identify 1-2 primary systems, which resulted in 20 systems by 14 groups. Each primary system is described in a short technical paper (available on the challenge website) and was evaluated both by automatic metrics and human judges (see Sec-tion 4). We compare the primary systems to a baseline based on the TGEN generator (Dušek and Jurčíček, 2016a). An overview of all primary systems is given in Table 3, including the main features of their architectures. A more detailed description and comparison of systems will be given in .

Word-overlap Metrics
Following previous shared tasks in related fields (Bojar et al., 2017b;Chen et al., 2015), we selected a range of metrics measuring word-overlap between system output and references, including BLEU, NIST, METEOR, ROUGE-L, and CIDEr.  Table 4: TrueSkill measurements of quality (left) and naturalness (right).
Significance cluster number, TrueSkill value, range of ranks where the system falls in 95% of cases or more, system name. Significance clusters are separated by a dotted line. Systems are colour-coded by architecture as in Table 3. to beat it in terms of all metrics -only SLUG comes very close. Several other systems beat TGEN in one of the metrics but not in others. 3 Overall, seq2seq-based systems show the best word-based metric values, followed by SHEFF1, a data-driven system based on imitation learning. Template-based and rule-based systems mostly score at the bottom of the list.

Results of Human Evaluation
However, the human evaluation study provides a different picture. Rank-based Magnitude Estimation (RankME)  was used for evaluation, where crowd workers compared outputs of 5 systems for the same MR and assigned scores on a continuous scale. We evaluated output naturalness and overall quality in separate tasks; for naturalness evaluation, the source MR was not shown to workers. We collected 4,239 5way rankings for naturalness and 2,979 for quality, comparing 9.5 systems per MR on average. The final evaluation results were produced using the TrueSkill algorithm (Herbrich et al., 2006;Sakaguchi et al., 2014), with partial ordering into significance clusters computed using bootstrap resampling (Bojar et al., 2013(Bojar et al., , 2014Sakaguchi et al., 2014). For both criteria, this resulted in 5 clusters of systems with significantly different performance and showed a clear winner: SHEFF2 for naturalness and SLUG for quality. The 2nd clusters are quite large for both criteria -they contain 13 and 11 systems, respectively, and both include the baseline TGEN system.
The results indicate that seq2seq systems dominate in terms of naturalness of their outputs, while most systems of other architectures score lower. The bottom cluster is filled with template-based systems. The results for quality are, however, more mixed in terms of architectures, with none of them clearly prevailing. Here, seq2seq systems with reranking based on checking output correctness score high while seq2seq systems with no such mechanism occupy the bottom two clusters.

Conclusion
This paper presents the first shared task on end-toend NLG. The aim of this challenge was to assess the capabilities of recent end-to-end, fully datadriven NLG systems, which can be trained from pairs of input MRs and texts, without the need for fine-grained semantic alignments. We created a novel dataset for the challenge, which is an orderof-magnitude bigger than any previous publicly available dataset for task-oriented NLG. We received 62 system submissions by 17 participating institutions, with a wide range of architectures, from seq2seq-based models to simple templates.
We evaluated all the entries in terms of five different automatic metrics; 20 primary submissions (as identified by the 14 remaining participants) underwent crowdsourced human evaluation of naturalness and overall quality of their outputs.
We consider the SLUG system (Juraska et al., 2018), a seq2seq-based ensemble system with a reranker, as the overall winner of the E2E NLG challenge. SLUG scores best in human evaluations of quality, it is placed in the 2nd-best cluster of systems in terms of naturalness and reaches high automatic scores. While the SHEFF2 system , a vanilla seq2seq setup, won in terms of naturalness, it scores poorly on overall quality -it placed in the last cluster. The TGEN baseline system turned out hard to beat: It ranked highest on average in word-overlap-based automatic metrics and placed in the 2nd cluster in both quality and naturalness.
The results in general show the seq2seq architecture as very capable, but requiring reranking to reach high-quality results. On the other hand, while rule-based approaches are not able to beat data-driven systems in terms of automatic metrics, they often perform comparably or better in human evaluations.
We are preparing a detailed analysis of the results  and a release of all system outputs with user ratings on the challenge website. 4 We plan to use this data for experiments in automatic NLG output quality estimation (Specia et al., 2010;Dušek et al., 2017), where the large amount of data obtained in this challenge allows a wider range of experiments than previously possible.