Sequence-to-Sequence Models for Data-to-Text Natural Language Generation: Word- vs. Character-based Processing and Output Diversity

We present a comparison of word-based and character-based sequence-to-sequence models for data-to-text natural language generation, which generate natural language descriptions for structured inputs. On the datasets of two recent generation challenges, our models achieve comparable or better automatic evaluation results than the best challenge submissions. Subsequent detailed statistical and human analyses shed light on the differences between the two input representations and the diversity of the generated texts. In a controlled experiment with synthetic training data generated from templates, we demonstrate the ability of neural models to learn novel combinations of the templates and thereby generalize beyond the linguistic structures they were trained on.


Introduction
Natural language generation (NLG) is an actively researched task, which according to Gatt and Krahmer (2018) can be divided into text-to-text generation, such as machine translation (Koehn, 2017), text summarization (See et al., 2017), or open-domain conversation response generation (Vinyals and Le, 2015) on the one hand, and data-to-text generation on the other hand. Here, we focus on the latter, the task of generating textual descriptions for structured data. Data-to-text generation comprises the generation of system responses based on dialog acts in task-oriented dialog systems (Wen et al., 2015b), sport games reports and weather forecasts (Angeli et al., 2010), and database entry descriptions (Gardent et al., 2017a). In this paper, we focus on sentence planning and surface realization. We build on data-to-text datasets of two recent shared tasks for end-toend NLG, namely the E2E challenge (Novikova et al., 2017b) and WebNLG challenge (Gardent et al., 2017b). Example input-text pairs for both datasets are shown in Figure 1.
Neural sequence to sequence (Seq2Seq) models (Graves, 2013;Sutskever et al., 2014) have shown promising results for this task, especially in combination with an attention mechanism Luong et al., 2015). Several recent NLG approaches (Dušek and Jurcícek, 2016;Mei et al., 2016;Kiddon et al., 2016;Agarwal and Dymetman, 2017), as well as most systems in the E2E and WebNLG challenge are based on this architecture. While most NLG models generate text word by word, promising results were also obtained by encoding the input and generating the output text character-by-character (Lipton et al., 2015;Goyal et al., 2016;Agarwal and Dymetman, 2017). Five out of 62 E2E challenge submissions operate on the character-level. However, it is difficult to draw conclusions from the challenge results with respect to this difference, since the submitted systems also differ in other aspects and were evaluated on a single dataset only.
Besides adequacy and fluency, variation is an important aspect in NLG (Stent et al., 2005). In addition to comparing the linguistic and contentwise correctness of word-and character-based Seq2Seq models through automatic and human evaluation, we investigate the variety of their outputs. While template-based systems can assure perfect content and linguistic quality, they often suffer from low diversity. Conversely, neural models might generalize beyond a limited amount of training texts or templates, thereby producing more diverse outputs. To test this hypothesis, we train Seq2Seq models on template-generated texts with a controlled amount of variation and show that they not only reproduce the templates, but also WebNLG input: cityServed(Abilene Regional Airport[Abilene]), isPartOf(Abilene[Texas]) reference 1: Abilene is in Texas and is served by the Abilene regional airport. reference 2: Abilene, part of Texas, is served by the Abilene regional airport. delexicalized input: city served(AGENT-1[BRIDGE-1]), is part of(BRIDGE-1[PATIENT-1]) delexicalized reference 1: BRIDGE-1 is in PATIENT-1 and is served by the AGENT-1. Figure 1: Example input-reference pairs from the E2E and WebNLG development set. generate novel structures resulting from template combinations.
In sum, we make the following contribution: • We compare word-and character-based Seq2Seq models for NLG on two datasets.
• We conduct an extensive automatic and manual analysis of the generated texts and compare them to human performance.
• In an experiment with synthetic training data generated from templates, we demonstrate the ability of neural NLG models to learn template combinations and thereby generalize beyond the linguistic structures they were trained on.

Related Work
This section reviews relevant related work according to the two main aspects of this paper: different input and output representations for data-totext NLG as well as measuring and controlling the variation in the generated outputs.

Input and Output Representations
While the first NLG systems relied on handwritten rules or templates that were filled with the input information (Cheyer and Guzzoni, 2006;Mirkovic et al., 2006), the availability of larger datasets has accelerated the progress in statistical methods to train NLG systems from data-text pairs in the last twenty years (Oh and Rudnicky, 2000;Mairesse and Young, 2014). Generating output via language models based on recurrent neural networks (RNNs) conditioned on the input (Sutskever et al., 2011) proved to be an effective method for end-to-end NLG (Wen et al., 2015a(Wen et al., ,b, 2016. The input can be represented in several ways: (1) In a discrete vector space via one-hotvectors (Wen et al., 2015a,b), or in a continuous space either (2) by encoding fixed-size input information in a feed-forward neural network (Zhou et al., 2017;Wiseman et al., 2017) or (3) by the means of an encoder RNN, which processes variable-sized inputs sequentially, giving rise to the Seq2Seq architecture.
Character-based Seq2Seq models were first proposed for neural machine translation (Ling et al., 2015;Chung et al., 2016;Lee et al., 2017). Their main advantage over word-based models is that they can represent an unlimited word inventory with a small vocabulary. They can learn to copy any string from the input to the output, which is especially useful for data-to-text NLG, as information from the input such as the name of a restaurant or a database entity is often expected to appear verbatim in the generated text. Word-based models, in contrast, have to make use of delexicalization during pre-and postprocessing (Wen et al., 2015b;Dušek and Jurcícek, 2016) or have to apply dedicated copy mechanisms (Gu et al., 2016;See et al., 2017;Wiseman et al., 2017) to handle open vocabularies. The other side of the coin is that sequences are much longer in character-based processing, implying longer dependencies and more computation steps for encoding and decoding.
Subword-based representations (Sennrich et al., 2016;Wu et al., 2016) can offer a trade-off between word-and character-based processing and are a popular choice in NMT and summarization (See et al., 2017). Here, the vocabulary consists of subword units of different lengths, which are assigned by minimizing the entropy on the training set. We also experimented with such representations in preliminary experiments, but found them to perform much worse than wordor character-based representations. Our impression is that recurring entity names in the training data coming from multiple reference texts for the same input lead to overfitting on the training vocabulary and to poor generalization to novel inputs. This is also reflected by the rather unsatisfying performance of subword-based approaches in the E2E 1 and WebNLG challenge (ADAPT system (Gardent et al., 2017b)).

Output Diversity
Evaluation of data-to-text NLG has traditionally centered around semantic fidelity, grammaticality, and naturalness (Gatt and Krahmer, 2018;Oraby et al., 2018b). More recently, the controllability of the style of the outputs and their variation has moved into focus as well (Ficler and Goldberg, 2017;Herzig et al., 2017;Oraby et al., 2018b,a). Oraby et al. (2018b) showed that the n-gram entropy of the outputs of a neural NLG system is significantly lower compared to its training data. This can be seen as evidence that the NLG system extracts only a few dominant patterns from the training data that it will generate over and over. Without explicit supervision signals, neural NLG models cannot distinguish linguistic or stylistic variation from noise. In the context of image caption generation Devlin et al. (2015) found Seq2Seq models to exactly reproduce sentences from their training data for 60% of the test instances.
Several approaches have been proposed to control NLG outputs with respect to certain stylistic aspects, e.g., mimicking a specific persona or character , personality traits (Mairesse and Walker, 2008;Herzig et al., 2017;Oraby et al., 2018b,a), or various linguistic aspects such as formality, voice, descriptiveness (Ficler and Goldberg, 2017;Bawden, 2017;Niu et al., 2017). All share the feature that the NLG model is conditioned on a representation of the desired aspect in addition to the usual semantic input representation. While this approach makes it possible to successfully control particular, clearly defined aspects of the generated texts, further research is needed to grant more flexible and comprehensive NLG output control.

Models
To encode variable-length inputs and generate variable-length texts, we implement a standard Seq2Seq model  with Long Short-Term Memory (LSTM) cells (Hochreiter and Schmidhuber, 1997) and attention. Given a training dataset of input-text pairs D = {(x 1 ,ȳ 1 ), (x 2 ,ȳ 2 ) . . . }, the model encodes an input sequence x = {x 1 . . . x n } of symbols x i into a sequence of hidden states {h 1 . . . h n } by applying a recurrent neural network (RNN) with LSTM cells that can store and forget sequence information: The decoder generates the output sequence y 1 . . . y m one symbol y t at a time by computing p(y t |y 1 . . The decoder output c t , also referred to as context vector, summarizes the input information in each decoding step as weighted sum of the encoder hidden states: ong et al., 2015). The decoder hidden states s t are computed recursively based on the previous output token and decoder output: s 0 is initialized to the final encoder hidden state h n , h 0 , c −1 are initialized to 0; • denotes concatenation. The parameters of the models are the input and output embedding matrices X in , X out , the encoder and decoder LSTM parameters, the attention matrix W a and the output matrix W out . They are optimized by minimizing the cross entropy of the generated texts y j with the given referencesȳ j for each example in the training set.
Instead of forcing the decoder to decide on a single output symbol in each decoding step, we apply beam search  to explore n-best partial hypotheses in parallel.
In the word-based model, each input symbol x t and output symbol y t denotes a token. In contrast, in the character-based model, each input and output symbol denotes a single character. Our models learn separate encoder and decoder embedding matrices.
We use two recently collected crowd-sourced data-to-text datasets since they are larger and offer more linguistic variety than previously available datasets (Novikova et al., 2017b;Gardent et al., 2017a). The E2E dataset (Novikova et al., 2017b) consists of 47K restaurant descriptions based on 5.7K distinct inputs of 3-8 attributes (name, area, near, eat type, food, price range, family friendly, rating), split into 4862 inputs for training, 547 for development and 630 for testing. The WebNLG dataset (Gardent et al., 2017a) contains 25K verbalizations of 9.6K inputs composed of 1-7 DBpedia triples from 15 categories such as athletes, comic characters, food, sport teams. It is divided into 6893 inputs for training, 872 for development and 1862 for testing. Both datasets have multiple verbalizations for each input. On average there are 8.3 (min. 1, max. 46) verbalizations per input in the E2E dataset and 2.63 (min. 1, max. 12) in the WebNLG dataset, respectively.
To preprocess both datasets, we lowercase all inputs and references and represent the inputs in the bracketed format as shown in Figure 1. For the word-based processing we additionally tokenize the texts with the nltk-tokenizer (Bird et al., 2009) and apply delexicalization, as also illustrated in Figure 1. For the E2E dataset we adopt the challenge's baseline delexicalization strategy (Dušek and Jurcícek, 2016), which replaces the values of the two open-class attributes name and near in the input and references by placeholders. For the WebNLG dataset, we adopt the delexicalization strategy of the TILBURG submissions to the challenge, since it performed well and does not require external information. They replaced the subject and object entities of the DBpedia triples in the input and text by numbered placeholders AGENT-N, PATIENT-N, BRIDGE-N, depending on whether they only appear as subject, object or in both roles in the input of an instance. Additionally, we split properties at the camel case in this dataset for both the word-and character-based models as proposed by the ADAPT and MELBOURNE submissions. Table 1 displays statistics for both datasets and processing types.

Experiments
We conduct our experiments with the OpenNMT toolkit (Klein et al., 2017), which we extend to also perform character-based processing. We  Table 1: E2E and WebNLG training split statistics for word-based processing after delexicalization and character-based processing.
tuned the hyperparameters for each dataset and processing method to optimize the BLEU score on the development sets. The word-based model for the E2E dataset is trained by stochastic gradient descent (SGD) (Robbins and Monro, 1951) and an initial learning rate of 1.0. For all other models, we achieved better performance with the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 0.001. If there is no improvement in the development perplexity, or in any case after the eighth epoch, we halve the learning rate. Also, we clip all gradients to a maximum of five. We use a batch size of 64. To prevent overfitting, we drop out units in the context vectors with a probability of 0.3. We keep the model with the lowest development perplexity in 13 training epochs. The word-based E2E model has 64-dimensional word embeddings and a single encoder and decoder layer with 64 units each. All other models use 500-dimensional word-or character embeddings and two layers in the encoder and decoder with 500 dimensions each. While a unidirectional encoder was sufficient for the word-based models, bidirectional encoders were beneficial for the character-based models on both datasets.
We use a beam size of 15 for decoding with the word-based models, and found a smaller beam of five to yield better results for the character-based models. This is probably due to the much smaller vocabulary size of the character-based models.
For automatic evaluation, we report BLEU (Papineni et al., 2002), which measures the precision of the generated n-grams compared to the references, and recall-oriented ROUGE-L (Lin, 2004), which measures the longest common subsequence between the generated texts and the references. We compute these scores with the E2E challenge evaluation script 2 . Table 2 and 3 display the results on the E2E and WebNLG test sets for models of the respective challenges and our own models 3 . Since the performance of neural models can vary considerably due to random parameter initialization and randomized training procedures (Reimers and Gurevych, 2017), we train ten models with different random seeds for each setting and report the average (avg) and standard deviation (SD).
On the E2E test set, our single best word-and character-based models reach comparable results to the best challenge submissions. The wordbased models achieve significantly higher BLEU and ROUGE-L scores than the character-based models 4 . On the WebNLG test set, the BLEU score of our best word-based model outperforms the best challenge submission by a small margin. The character-based model achieves a significantly higher ROUGE-L score than the wordbased model, whereas the BLEU score difference is not significant. In the following, we analyze our models in more detail.

Analysis of Within-Model Performance Differences
The large performance span of the character-based models on the E2E dataset is due to a single outlier model; the second worst model scores 64.5 BLEU points. The worst-scoring model had a lower accuracy of 91.8% on the development set, whereas all other models scored above 92.2%. To gain more insight on what might constitute the large performance difference, we manually compared the generated texts for ten randomly selected inputs for each number of attributes (60 inputs in total) of the character-based model with the best and worst BLEU score. We found that the worst model makes many mistakes on inputs with three to five attributes, often adding, modifying or removing information, whereas the outputs are mostly correct for inputs with six attributes or more. For these, the outputs of the model with the lowest BLEU score are occasionally even better than those of the best model, which often omits in-3 For an exact comparison, we recomputed the WebNLG challenge results with the E2E evaluation script. They are usually 1-2 points below the scores reported by Gardent et al. (2017b). 4 All tests for significance in this paper are conducted with Wilcoxon rank sum tests with Bonferroni correction at a plevel of 0.05. formation (mainly concerning the attribute family friendly). We conclude that the large performance difference might be caused by automatic evaluation measures punishing additions more severely than omissions.
We also observe a large performance span for the WebNLG word-based models. Here, we have two models that score exceptionally well with 57.4/58.4 BLEU points, whereas the remaining eight models only obtain BLEU scores in a range of 43.8-48.1. Again, we observe that better models in terms of BLEU score obtain higher accuracies on the development set. We manually compared the outputs of ten randomly chosen inputs for each number of input triples (75 inputs in total) for the model with the highest and lowest BLEU score. In this case, we found that the large difference in the automatic evaluation measures seems justified: The low-scoring model often hallucinates information not present in the input and generally produces many ungrammatical texts, which is not the case for the best model.  Own results correspond to avg±SD of ten runs and single result of best models on the development set.

Automatic Evaluation of Human Texts
To gain an impression of the expressiveness of the automatic evaluation scores for NLG, we computed the average scores that the human references would obtain.  character-based models 5 . Strikingly, on the E2E development set, both model variants significantly outperform human texts by far with respect to both automatic evaluation measures. While the human BLEU score is significantly higher than those of both systems on the WebNLG development set, there is no statistical difference between human and system ROUGE-L scores. This further demonstrates the limited utility of BLEU and ROUGLE-L scores to evaluate NLG outputs, which was previously suggested by weak correlations of such scores with human judgments (Scott and Moore, 2006;Reiter and Belz, 2009;Novikova et al., 2017a). Furthermore, the high scores on the E2E dataset imply that the models succeed in picking up patterns from the training data that transfer well to the similar development set, whereas human variation and creativity are punished by lexical overlap-based automatic evaluation scores.

Manual Error Analysis
Since the expressiveness of automatic evaluation measures for NLG is limited, as shown in the previous subsection, we performed a manual error analysis on inputs of each length. We define the input length as the number of input attributes for the E2E dataset, ranging from three to eight, and number of input triples for the WebNLG dataset, ranging from one to seven. We randomly selected 5 For a fair comparison between human and model performance, we randomly removed one reference for each instance in the models' evaluation to ensure the same average number of references. We excluded 55 WebNLG instances that had only one reference.  One annotator (one of the authors of this paper) manually assessed the outputs of the models that obtained the best development set BLEU score as summarized in Table 5 Table 6: Linguistic diversity of development set references and generated texts as avg±SD. '% new' denotes the share of generated texts or sentences that do not appear in training references. Higher indicates more diversity for all measures.
Comparing the two datasets, we again observe that the WebNLG dataset is much more challenging than the E2E dataset, especially with respect to correctly verbalizing the content. This can be attributed to the increased diversity of the inputs and texts and to the limited availability of training data for this dataset (cf. Table 1). Moreover, spelling mistakes only appeared in WebNLG texts, mainly concerning omissions of accents or umlauts. This also indicates that there is too few and noisy data for the models to learn the correct spelling of all words. Notably, we did not observe any nonwords generated by the character-based models.
The most frequent content error in both datasets concerns omission of information. For the E2E dataset, the family friendly attribute is most frequently dropped by both model types, indicating that the verbalization of this boolean attribute is more difficult to learn than other attributes, whose values mostly appear verbatim in the text. Information modification of the word-based model is mainly due to confusing English with Italian food. Information addition and repetition only occur in the WebNLG dataset. The latter is an especially frequent problem of the character-based model, affecting more than a quarter of all texts.
In comparison, character-based models reproduce the content more faithfully on the E2E dataset while offering the same level of linguistic quality as word-based models, leading to more correct outputs overall. On the WebNLG dataset, the word-based model is more faithful to the inputs, probably because of the effective delexicalization strategy, whereas the character-based model errs less on the linguistic side. Overall, the word-based model yields more correct texts, stressing the importance of delexicalization and data normalization in low resource settings.

Automatic Evaluation of Output Diversity
While correctness is a necessity in NLG, in many settings it is not sufficient. Often, variation of the generated texts is crucial to avoid repetitive and unnatural outputs. Table 6 shows automatically computed statistics on the diversity of the generated texts of both models and human texts and on the overlap of the (generated) texts with the training set. We measure diversity by the number of unique sentences and words in all development set references and generated texts, as done e.g. by Devlin et al. (2015). Additionally, we report the Shannon text entropy as measure of the amount of variation in the texts following (Oraby et al., 2018b). We compute the text entropy E for words (unigrams) and uni-, bi-, and trigrams as follows: where V is the set of all word types or uni-, biand trigrams, f denotes frequency and total is the token count or total number of uni-, bi-and trigrams in the texts, respectively.
To measure the extent by which the models generalize beyond plugging in restaurant or other entity names into templates extracted from the training data, we compute the results on the delexicalized outputs of the word-based models and delexicalize the character-based models' outputs. For the human scores, we generate n artificial prediction files, treating each n-th reference (42 for E2E, 8 for WebNLG) as reference, apply delexicalization, and average the scores for the n files.
Learned combinations of Template 1 and 2: • ::::: NAME :: is :: a ::::::::: restaurant which serves English food in the moderate price range. It is located in the city centre area, near NEAR. It has a customer rating of 1 out of 5. It is not family friendly.

Generalizing from Templates
In search for empirical evidence that neural models are able to surpass the structures they were trained on, we train Seq2Seq models with synthetic training data created by templates. This enables us to control the variation in the training data and identify novel generations of the model (if any). We investigate two questions: (1) Do the neural NLG models indeed accurately learn the templates from the training data? (2) Do they learn to combine the training templates to produce more varied outputs than seen during training?
We generate synthetic training data based on two templates. Template 1 corresponds to UKP-TUDA's submission to the E2E challenge 7 , where the order of describing the input information is fixed. Specifically, the restaurant's customer rating is always mentioned before its location. For Template 2, we change the the beginning of the template and switch the order of mentioning the rating and location of the restaurant as shown in Figure 2. Potential combinations of the two templates are to combine the beginning of Template 1 with the ordering of rating and area of Template 2 or vice versa. We generate a single reference text for all 2261 training inputs of the E2E dataset where the NAME and EATTYPE attribute are present as these are the two obligatory attributes for the templates. We train wordbased models on training data generated with Template 1, Template 2 and the concatenation of the training data from Template 1 and 2. To keep the amount of training data equal in all experiments, we once repeat the training corpus generated only with Template 1 or Template 2. The hyperparameters for the three models can be found in the appendix.  Table 7: Manual evaluation of generated texts for 10 random test instances of a word-based model trained with synthetic training data from two templates. c@n: avg. number of correct texts (with respect to content and language) among the top n hypotheses. Table 7 shows our manual evaluation of the top 30 hypotheses for 10 random E2E test inputs generated by models trained with data synthesized from the two templates. As is evident from the first two rows, all models learned to generalize from the training data to produce correct texts for novel inputs consisting of unseen combinations of input attributes. It was verified in the manual evaluation that 100% of the texts generated by models trained on a single template adhered to this template. Yet, the picture is a bit different for the model trained on data generated by both templates. While the top two hypotheses are equally distributed between adhering to Template 1 and Template 2, more than 5% among the lower-ranked hypotheses constitute a template combination such as the example shown in the bottom part of Figure 2. For 60% of the examined inputs, there was at least one such hypothesis resulting from template combination, of which two thirds were actually correct verbalizations of the input.
Since we found that the models frequently ranked correct hypotheses below hypotheses with content errors, we implemented a simple rulebased reranker based on verbatim matches of attribute values. The reranker assigns an error point to each omission and addition of an attribute value. As can be seen in the final row of Table 7, this simple reranker successfully places correct hypotheses higher up in the ranking, improving the practical usability of the generation model by now offering almost three correct variants for each input among the top five hypotheses on average.

Conclusion
We compared word-based and character-based Seq2Seq models for data-to-text NLG on two datasets and analyzed their output diversity. Our main findings are as follows: Overall, Seq2Seq models can learn to verbalize structured inputs in a decent way; their success depends on the extent of the domain and available (clean) training data.
Second, in a comparison with texts produced by humans, we saw that neural NLG models can even surpass human performance in terms of automatic evaluation measures. On the one hand, this unveils the ability of the models to extract general patterns from the training data that approximate many reference texts, but on the other hand also once more stresses the limited utility of such measures to evaluate NLG systems.
Third, in light of the multi-faceted analysis we performed, it is difficult to draw a general conclusion on whether word-or character-based processing is more useful for data-to-text generation. Both models yielded comparable results with respect to automatic evaluation measures. In the manual error analysis, the character-based model performed better on the E2E dataset, whereas the word-based model generated more correct outputs on the WebNLG dataset. Character-based models were found to have a significantly higher output diversity.
Finally, in a controlled experiment with wordbased Seq2Seq models trained on data synthesized from templates, we showed the capability of such models to perfectly reproduce the templates they were trained on. More importantly, models trained on two templates could generalize beyond their training data and come up with novel texts. In future work, we would like to extend this line of research and train more model variants on a higher number of templates.

A Hyperparameters for Models Trained on Synthetic Training Data
For the model trained on template-generated data, we tune the hyperparameters to achieve 100% accuracy for their best hypotheses on templategenerated references on the development set. All models have a single-layer LSTM with 64 hidden units in the encoder and decoder. We half the learning rate starting from the eighth training epoch or if the perplexity of the validation set does not improve. The gradient norm is capped at two. The decoder uses the general attention mechanism. For decoding, we set the beam size to 30.  Table 8: Hyperparameters for the models trained on synthetic training data generated from Template 1 (T 1), Template 2 (T 2) and both (T 1+2).