Challenge : Neural Models vs . Templates

E2E NLG Challenge is a shared task on generating restaurant descriptions from sets of key-value pairs. This paper describes the results of our participation in the challenge. We develop a simple, yet effective neural encoder-decoder model 1 which produces fluent restaurant descriptions and outperforms a strong baseline. We further analyze the data provided by the organizers and conclude that the task can also be approached with a template-based model developed in just a few hours.


Introduction
Natural Language Generation (NLG) is the task of generating natural language utterances from structured data representations. The E2E NLG Challenge 2 is a shared task which focuses on end-toend data-driven NLG methods. These approaches attract a lot of attention, because they perform joint learning of textual structure and surface realization patterns from non-aligned data, which allows for a significant reduction of the amount of human annotation effort needed for NLG corpus creation (Wen et al., 2015;Mei et al., 2016;Dušek and Jurcicek, 2016;Lampouras and Vlachos, 2016).
The contribution of our submission to the challenge can be summarized as follows: (1) we show how exploiting data properties allows us to design more accurate neural architectures; (2) we develop a simple template-based system which achieves performance comparable to neural approaches.

Task Definition
The organizers of the shared task provided a crowdsourced data set of 50k instances in the restaurant domain (Novikova et al., 2017b). Each training instance consists of a dialogue act-based meaning representation (MR) and up to 16 references in natural language ( Figure 1). The data was collected using pictorial representations as stimuli, with the intention of creating more natural, informative and diverse human references compared to the ones one might generate from textual inputs.
The task is to generate an utterance from a given MR, which is both similar to human-generated reference texts and highly rated by humans. Similarity is assessed using standard evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), METEOR (Lavie and Agarwal, 2007), ROUGE-L (Lin, 2004), CIDEr (Vedantam et al., 2015. However, the final assessment is done via human ratings obtained using a mixture of crowdsourcing and expert annotations. To facilitate a better assessment of the proposed approaches, the organizing team used TGen (Dušek and Jurcicek, 2016), one of the recent E2E datadriven systems, as a baseline. It is a sequence-tosequence neural system with attention (Bahdanau et al., 2014). TGen uses beam search for decod-ing, incorporates a reranker over the top k outputs, penalizing the candidates that do not verbalize all attributes from the input MR. TGen also includes a delexicalization module which deals with sparsely occurring MR attributes (name, near) by mapping such values to placeholder tokens when preprocessing the input data, and substituting the placeholders with actual values as a post-processing step.

Our Approach
This section describes two different approaches we developed for the shared task.
The first one (Model-D, for "data-driven") is an encoder-decoder neural system which is similar to TGen, but uses a more efficient encoder module. The second approach is a simple template-based model (Model-T, for "template-based") which we developed based on the results of the data analysis.

Model-D
Model-D was motivated by two important properties of the E2E NLG Challenge data: • fixed number of unique MR attributes • low diversity of the lexical instantiations of the MR attribute values Each input MR contains a fixed number of unique attributes (between three and eight), which allows us to associate a positional id with each attribute and omit the corresponding attribute names (or keys) from the encoding procedure. This shortens the encoded sequence, presumably making the learning procedure easier for the encoder. This also unifies the lengths of input MRs and thus allows us to use simpler and more efficient neural networks which are not sequential and process input sequences in one step (e.g. multilayer perceptron (MLP) networks).
One might argue that using an MLP would be complicated by the fact that neither the number of active (non-null value) input MR keys nor the number of tokens constituting the corresponding values is fixed. For example, an MR key price may have a one-token value of "low" or a more lengthy "less than £10". However, realizations of the MR attribute values exhibit low variability: six out of eight keys have less than seven unique values, while the remaining two keys (name, near) denote named entities and thus are easy to delexicalize. This allows us to treat each value as a single token,  even if it consists of multiple words (e.g. "more than £30", "Fast food"). Each predicted output is a textual description of a restaurant. As reported by Novikova et al. (2017b), the average number of words per reference is 20.1. We used the value of 50 as a cut-off threshold, filtering out training instances with long restaurant descriptions.
The overall architecture of our model is shown in Figure 2. The system is an encoder-decoder model (Cho et al., 2014b;Sutskever et al., 2014) consisting of three main modules: an embedding matrix, one dense hidden layer as an encoder and a RNN-based decoder with gated recurrent units (GRU) (Cho et al., 2014a).
Let us first describe the input specifications of the model. We will use the following MR instance as a running example:

name[Wrestlers] customerRating[high] familyFriendly[yes]
Considering the alphabetic ordering of the MR key names, we can assign positional ids to the keys as shown in Table 1. The remaining five keys are assigned dummy PAD values.
Given an instance of a (MR, text) pair, we decompose the MR into eight components (mr j in Figure 2), each corresponding to a value for a unique MR key, and add an end-of-sentence symbol (EOS) to denote the end of the encoded sequence. Each component is represented as a high-dimensional embedding vector. Each embedding vector is further mapped to a dense hidden representation via an affine transformation followed by a ReLu (Nair and Hinton, 2010) function. These hidden representations are further used by the decoder network, which in our case is a unidirectional GRU-based RNN with an attention module (Bahdanau et al., 2014). The decoder is initialized with an average of the encoder outputs.
The decoder generates a sequence of tokens, one Here [X] denotes the value of the MR key X. This pattern became a central template of Model-T. Not all MR attribute verbalizations fit into this schema. For example, a key-value pair customer-Rating[3 out of 5] would be verbalized as ". . . has a 3 out of 5 customer rating", which is not the best phrasing one can come up with. A better way to describe it is ". . . has a customer rating of 3 out of 5". We incorporate such variations into Model-T with a set of simple rules which modify the general template depending on a specific value of an MR attribute.
As mentioned in Section 2.1, each instance's input can have up to eight MR attributes. In order to account for this fact, we decomposed the general template into smaller components, each corresponding to a specific MR attribute mentioned in the input. We further developed a set of rules which activate each component depending on whether an MR attribute is part of the input. For example, if  Finally, we also add a simple post-processing step to handle specific punctuation and article choices. Table 2 shows the results of metric evaluation of the systems. Since we were provided with only one TGen prediction file and a single performance score, comparing score distributions is not possible and statistical significance tests are not meaningful due to the non-deterministic nature of the approaches based on neural networks and randomized training procedures (Reimers and Gurevych, 2017). In order to facilitate a fair comparison with other competing systems, we report the mean development score of Model-D (averaged across twenty runs with different random seeds) and performance variance for each automatic metric. Model-T is a deterministic system, so it is sufficient to report the results of a single run.

Metric Evaluation
The results show that Model-D outperforms  TGen as measured by all five metrics, albeit the performance variance is quite large. Model-T clearly scores below both TGen and Model-D. This is expected, since Model-T is not data-driven, and hence the texts it generates might be different from the reference outputs.
Previous studies have shown that widely used automatic metrics (including the ones used in our competition) lack strong correlation with human judgments (Scott and Moore, 2007;Reiter and Belz, 2009;Novikova et al., 2017a). We decided to examine the predictions made by the compared systems on one hundred randomly sampled input instances, focusing on generic errors, which make sense to look out for in many NLG scenarios. Table 3 shows the error types and the number of mistakes found in each of the prediction files. The error types should be self-explanatory (sample predictions are given in Appendix A.2).
As far as the (subjective) manual analysis goes, Model-T outputs descriptions with the best linguistic quality. Table 3 shows that the predictions of the template-based system contain no errors -this is because we incorporated our notion of grammaticality into the templates' definition, which allowed Model-T to avoid the errors found in predictions of the other two approaches.
The majority of errors made by Model-D are either wrong verbalizations of the input MR values or punctuation mistakes. The latter ones are limited to the cases of missing a comma between clauses or not finishing a sentence with a full stop. An easy solution to this problem is adding a postprocessing step which fixes punctuation mistakes before outputting the text.
Crucially, Model-D often drops or modifies some MR attribute values. According to the organizers, 40% of the data by design contain either additional or omitted information on the output side (Novikova et al., 2017b): crowd workers were allowed to not lexicalize attribute values which they deemed unimportant. We decided to examine the training data and find out if the discrepancies of Model-D were learned from the data.

Training Data Analysis
The E2E NLG Challenge is based on noisy data, but the organizers provided multiple instances to account for this noise. In order to better understand the behaviour of Model-D and determine if it took advantage of having multiple references per training instance, we have randomly sampled a hundred training instances and manually checked their linguistic quality. Table 4 shows the most common errors we encountered.
Most mistakes come from ungrammatical constructions, e.g. incorrect phrase attachment decisions ("The price of the food is high and is located . . . "), incorrect usage of articles ("located in riverside"), repetitive constructions ("Cotto, an Indian coffee shop located in . . . , is an Indian coffee shop . . . "). Some restaurant descriptions follow a tweet-style narration pattern which is understandable, but ungrammatical ("The Golden Palace Italian riverside coffee shop price range moderate and customer rating 1 out of 5").
A considerable number of instances have restaurant descriptions which contain information that does not entirely follow from the given input MR. These are cases in which input content elements are modified or dropped, which goes in line with what we observed in the outputs of Model-D. Some instances (10%) contained descriptions which we marked as questionable due to pragmatic and/or stylistic considerations. For example, restaurants which have familyFriendly[no] as part of the input MR are often described by crowd workers as "adults-only" establishments, which has an undesirable connotation. Finally, it is necessary to mention that some crowd workers followed inconsistent spelling and punctuation rules when hyphenating compound modifiers ("family friendly restaurant", "the restaurant is family friendly") or capitalizing MR attributes ("Riverside", "Fast food"). Punctuation errors were mainly restricted to missing a full stop at the end of a restaurant description or failing to delimit sentence clauses with commas.
The results of manual data analysis show that Model-D indeed generates texts that are similar to the restaurant descriptions in the provided data set. Unfortunately, our data-driven approach is not flexible enough to make use of multiple references; it cannot cancel out the noise present in some train- punctuation errors "X is a coffee shop and also a Japanese restaurant great for family and close to Crowne Plaza Hotel" 6  ing instances. One way of alleviating this problem could be reformulating the loss function to inform the system about the existence of multiple ways of generating a good restaurant description. Given a training instance, Model-D would generate a corresponding candidate text which could be compared to all human references. Each comparison results in computing a certain cost; the gradients could be then computed on the minimal cost among all comparisons.

Final Evaluation
For the final submission we have chosen Model-T's predictions -despite lower metric scores, they contained most grammatical outputs and kept all input information in the generated text. The results of the final evaluation on the test data are presented in Table 5. They were produced by the TrueSkill algorithm (Sakaguchi et al., 2014), which performs pairwise system comparisons and clusters them into groups. For completeness, we include the highest reported scores among all the participants (rightmost column). Note, however, that the numerical scores are not directly interpretable, but the relative ranking of a system in terms of its range and cluster is important -systems within one cluster are considered tied.
Model-T was assigned to the second best cluster both in terms of quality and naturalness, despite the much lower metric scores. Retrospectively, this justifies our decision to choose Model-T instead of Model-D for the final submission. The E2E NLG Challenge focuses on end-to-end data-driven NLG methods, which is why systems like Model-T might not exactly fit into the task setup. Nevertheless, we view such a system as a necessary candidate for comparison, since the E2E NLG Challenge data was designed to learn models that produce "more natural, varied and less template-like system utterances" (Novikova et al., 2017b).

Conclusion
In this paper we have presented the results of our participation in the E2E NLG Challenge. We have developed two conceptually different approaches and analyzed their performance, both in quantity and in quality. We have shown that sometimes the costs of developing complex data-driven models are not justified and one is better off approaching the problem with simpler techniques. We hope that our observations and conclusions shed some light on the limitations of modern NLG approaches and possible ways of overcoming them.