End-to-End Content and Plan Selection for Data-to-Text Generation

Learning to generate fluent natural language from structured data with neural networks has become an common approach for NLG. This problem can be challenging when the form of the structured data varies between examples. This paper presents a survey of several extensions to sequence-to-sequence models to account for the latent content selection process, particularly variants of copy attention and coverage decoding. We further propose a training method based on diverse ensembling to encourage models to learn distinct sentence templates during training. An empirical evaluation of these techniques shows an increase in the quality of generated text across five automated metrics, as well as human evaluation.


Introduction
Recent developments in end-to-end learning with neural networks have enabled methods to generate textual output from complex structured inputs such as images and tables. These methods may also enable the creation of text-generation models that are conditioned on multiple key-value attribute pairs. The conditional generation of fluent text poses multiple challenges since a model has to select content appropriate for an utterance, develop a sentence layout that fits all selected information, and finally generate fluent language that incorporates the content. End-to-end methods have already been applied to increasingly complex data to simultaneously learn sentence planning and surface realization but were often restricted by the limited data availability (Wen et al., 2015;Mei et al., 2015;Dušek and Jurčíček, 2016;Lampouras and Vlachos, 2016). The re-MR name: The Golden Palace, eatType: coffee shop, food: Fast food, priceRange: cheap, customer rating: 5 out of 5, area: riverside Reference A coffee shop located on the riverside called The Golden Palace, has a 5 out of 5 customer rating. Its price range are fairly cheap for its excellent Fast food. Figure 1: An example of a meaning representation and utterance pair from the E2E NLG dataset. Each example comprises a set of key-value pairs and a natural language description. cent creation of datasets such as the E2E NLG dataset (Novikova et al., 2017) provides an opportunity to further advance methods for text generation. In this work, we focus on the generation of language from meaning representations (MR), as shown in Figure 1. This task requires learning a semantic alignment from MR to utterance, wherein the MR can comprise a variable number of attributes.
Recently, end-to-end generation has been handled primarily by Sequence-to-sequence (S2S) models (Sutskever et al., 2014;Bahdanau et al., 2014) that encode some information and decode it into a desired format. Extensions for summarization and other tasks have developed a mechanism to copy words from the input into a generated text (Vinyals et al., 2015;See et al., 2017).
We begin with a strong S2S model with copymechanism for the E2E NLG task and include methods that can help to control the length of a generated text and how many inputs a model uses (Tu et al., 2016;Wu et al., 2016). Finally, we also present results of the Transformer architecture (Vaswani et al., 2017) as an alternative S2S variant. We show that these extensions lead to improved text generation and content selection.
We further propose a training approach based on the diverse ensembling technique (Guzman-Rivera et al., 2012). In this technique, multiple models are trained to partition the training data during the process of training the model itself, thus leading to models that follow distinct sentence templates. We show that this approach improves the quality of generated text, but also the robustness of the training process to outliers in the training data.
Experiments are run on the E2E NLG challenge 1 . We show that the application of this technique increases the quality of generated text across five different automated metrics (BLEU, NIST, METEOR, ROUGE, and CIDEr) over the multiple strong S2S baseline models (Dušek and Jurčíček, 2016;Vaswani et al., 2017;Su et al., 2018;Freitag and Roy, 2018). Among 60 submissions to the challenge, our approach ranked first in METEOR, ROUGE, and CIDEr scores, third in BLEU, and sixth in NIST.

Related Work
Traditional approaches to natural language generation separate the generation of a sentence plan from the surface realization. First, an input is mapped into a format that represents the layout of the output sentence, for example, an adequate pre-defined template. Then, the surface realization transforms the intermediary structure into text (Stent et al., 2004). These representations often model the hierarchical structure of discourse relations (Walker et al., 2007). Early data-driven approach used phrase-based language models for generation (Oh and Rudnicky, 2000;Mairesse and Young, 2014), or aimed to predict the best fitting cluster of semantically similar templates (Kondadadi et al., 2013). More recent work combines both steps by learning plan and realization jointly using end-to-end trained models (e.g. Wen et al., 2015). Several approaches have looked at generation from abstract meaning representations (AMR), and Peng et al. (2017) apply S2S models to the problem. However, Ferreira et al. (2017) show that S2S models are outperformed by phrase-based machine translation models in small datasets. To address this issue, Konstas et al. (2017) propose a semi-supervised training method that can utilize English sentences outside of the training set to train parts of the model. We address the issue by using copy-attention to enable the model to copy words from the source, which helps to generate out of vocabulary and rare words. We note that end-to-end trained models, including our approach, often do not explicitly model the sentence planning stage, and are thus not directly comparable to previous work on sentence planning. This is especially limiting for generation of complex argument structures that rely on hierarchical structure.
For the task of text generation from simple keyvalue pairs, as in the E2E task, Juraska et al.
(2018) describe a heuristic based on word-overlap that provides unsupervised slot alignment between meaning representations and open slots in sentence plans. This method allows a model to operate with a smaller vocabulary and to be agnostic to actual values in the meaning representations. To account for syntactic structure in templates, Su et al. (2018) describe a hierarchical decoding strategy that generates different part of speech at different steps, filling in slots between previously generated tokens. In contrast, our model uses copyattention to fill in latent slots inside of learned templates. Juraska et al. (2018) also describe a data selection process in which they use heuristics to filter a dataset to the most natural sounding examples according to a set of rules. Our work aims at the unsupervised segmentation of data such that one model learns the most natural sounding sentence plans.

Background: Sequence-to-Sequence Generation
We start by introducing the standard a text-totext problem and discuss how to map structured data into a sequential form. Let (x (0) , y (0) ), . . . (x (N ) , y (N ) ) ∈ (X , Y) be a set of N aligned source and target sequence pairs, with (x (i) , y (i) ) denoting the ith element in (X , Y) pairs. Further, let x = x 1 , . . . , x m be the sequence of m tokens in the source, and y = y 1 , . . . , y n the target sequence of length n. Let V be the vocabulary of possible tokens, and [n] the list of integers up to n, [1, . . . , n].
S2S aims to learn a distribution parametrized by θ to maximize the conditional probability of p θ (y|x). We assume that the target is generated from left to right, such that p θ (y|x) = n t=1 p θ (y t |y [t−1] , x), and that p θ (y t |y [t−1] , x) takes the form of an encoder-decoder architecture with attention. The training aims to maximize the log-likelihood of the observed training data.
We evaluate the performance of both the LSTM (Hochreiter and Schmidhuber, 1997) and Transformer (Vaswani et al., 2017) architecture. We additionally experiment with two attention formulations. The first uses a dot-product between the hidden states of the encoder and decoder (Luong et al., 2015). The second uses a multi-layer perceptron with the hidden states as inputs (Bahdanau et al., 2014). We refer to them as dot and MLP respectively. Since dot attention does not require additional parameters, we hypothesize that it performs well in a limited data environment.
In order to apply S2S models, a list of attributes in an MR has to be linearized into a sequence of tokens (Konstas et al., 2017;Ferreira et al., 2017). Not all attributes have to appear for all inputs, and each attribute might have multi-token values, such as area: city centre. We use special start and stop tokens for each possible attribute to mark value boundaries; for example, an attribute area: city centre becomes start area city centre end area . These fragments are concatenated into a single sequence to represent the original MR as an input sequence to our models. In this approach, no values are delexicalized, in contrast to Juraska et al. (2018) and others who delexicalize a subset of attributes. An alternative approach by Freitag and Roy (2018) treats the attribute type as an additional feature and learn embeddings for words and types separately.

Learning Content Selection
We extend the vanilla S2S system with methods that address the related problem of text summarization. In particular, we implement the pointergenerator network similar to that introduced by Nallapati et al. (2016) and See et al. (2017), which can generate content by copying tokens from an input during the generation process.
Copy Model The copy model introduces a binary variable z t for each decoding step t that acts as a switch between copying from the source and generating words. We model the joint probability following the procedure described by Gulcehre et al. (2016) as To calculate the switching probability p(z t |y [t−1] , x), let v ∈ R d hid be a trainable parameter. The hidden state of the decoder h t is used to compute p(z t ) = σ(h T t v) and decompose the joint distribution into two parts: where every term is conditioned on x and y [t−1] . p(y t |z t = 0) is the distribution generated by the previously described S2S model, and p(y t |z t = 1) is a distribution over x that is computed using the same attention mechanism with separate parameters.
In our problem, all values in the MR's should occur in the generated text and are typically words that would not be generated by a language model. This allows us to use an assumption by Gulcehre et al. (2016) that every word that occurs in both source and target was copied, which avoids having to marginalize over z. Then, the log-likelihood of y t and z t is maximized during training. This approach has the further advantage that it can handle previously unseen input by learning to copy these words into the correct position.
Coverage and Length Penalty We observed that generated text using vanilla S2S models with and without copy mechanism commonly omits some of the values in their inputs. To mitigate this effect, we use two penalty terms during inference; a length and a coverage penalty. We are using a coverage penalty during inference only, opposed to Tu et al. (2016) who introduced a coverage penalty term into the attention of an S2S model for neural machine translation and See et al. (2017) who used the same idea for abstractive summarization. Instead, we use the penalty term cp defined by Wu et al. (2016) as Here, β is a parameter to control the strength of the penalty. This penalty term increases when too many generated words attend to the same input. We typically do not want to repeat the name of the Figure 2: The multiple-choice loss for a single training example. L i has the smallest loss and receives parameter updates.
restaurant or the type of food it serves. Thus, we only want to attend to the restaurant name once when we actually generate it. We also use the length penalty lp by Wu et al. (2016), defined as where α is a tunable parameter that controls how much the likelihoods of longer generated texts are discounted. The penalties are used to re-rank beams during the inference procedure such that the full score function s becomes A final inference time restriction of our model is the blocking of repeat sentence beginnings. Automatic metrics do not punish a strong parallelism between sentences, but repeat sentence beginnings interrupt the flow of a text and make it look unnatural. We found that since each model follows a strict latent template during generation, the generated text would often begin every sentence with the same words. Therefore, we encourage syntactic variation by pruning beams during beam search that start two sentences with the same bigram. Paulus et al. (2017) use similar restrictions for summarization by blocking repeated trigrams across the entire generated text. Since automated evaluation does not punish repeat sentences, we only enable this restriction when generating text for the human evaluation.

Learning Latent Sentence Templates
Each generated text follows a latent sentence template to describe the attributes in its MR. The model has to associate each attribute with its location in a sentence template. However, S2S models can learn wrong associations between inputs and targets with limited data, which was also shown by Ferreira et al. (2017). Additionally, consider that we may see the generated texts for similar inputs: There is an expensive British Restaurant called the Eagle. and The Eagle is an expensive, British Restaurant.. Both incorporate the same information but have a different structure. A model that is trained on both styles simultaneously might struggle to generate a single output sentence. To address this issue and to learn a set of diverse generation styles, we train a mixture of models where every sequence is still generated by a single model. The method aims to force each model to learn a distinct sentence template.
The mixture aims to split the training data between the models such that each model trains only on a subset of a data, and can learn a different template structure. Thus, one model does not have to fit all the underlying template structures simultaneously. Moreover, it implicitly removes outlier training examples from all but one part of the mixture. Let f 1 , . . . , f K be the K models in the mixture. These models can either be completely disjoint or share a subset of their parameters (e.g. the word embeddings, the encoder, or both encoder and decoder). Following Guzman-Rivera et al.
(2012), we introduce an unobserved random variable w ∼ Cat(1/K) that assigns a weight to each model for each input. Let p θ (y|x, w) denote the probability of an output y for an input x with a given segmentation w. The likelihood for each point is defined as a mixture of the individual likelihoods, By constraining w to assume either 0 or 1, the optimization problem over the whole dataset becomes a joint optimization of assignments of models to data points and parameters to models.
To maximize the target, Guzman-Rivera et al.  data points. This process repeats until the point assignments converge. Related work by Kondadadi et al. (2013) has shown that models compute clusters of templates Further work by Lee et al. (2016) reduce the computational overhead by introducing a stochastic MCL (sMCL) variant that does not require retraining. They compute the posterior over p(w|x, y) in the E-Step by choosing the best model for an examplek = argmax k∈[K] p θ (y|x, w k = 1, w ¬k = 0). Setting wĥ to 1 and all other entries in w to 0 achieves a hard segmentation for this point. After this assignment, only the model k with the minimal negative log-likelihood is updated in the M-Step. A potential downside of this approach is the linear increase in complexity since a forward pass has to be repeated for each model.
We illustrate the process of a single forwardpass in Figure 2, in which a model f i has the smallest loss L and is thus updated. Figure 3 demonstrates an example with K = 2 in which the two models generate text according to two different sentence layouts. We find that averaging predictions of multiple models during inference, a technique commonly used with traditional ensembling approaches, does not lead to increased performance. We further confirm findings by Lee et al. (2017) who state that these models overestimate their confidence when generating text. Since it is our goal to train a model that learns the best  For all LSTM-based S2S models, we use a twolayer bidirectional LSTM encoder, and hidden and embedding sizes of 750. During training, we apply dropout with probability 0.2 and train models with Adam (Kingma and Ba, 2014) and an initial learning rate of 0.002. We evaluate both mlp and dot attention types. The Transformer model has 4 layers with hidden and embedding sizes 512. We use the training rate schedule described by Vaswani et al. (2017), using Adam and a maximum learning rate of 0.1 after 2,000 warmup steps. The diverse ensembling technique is applied to all approaches, pre-training all models for 4 epochs and then activating the sMCL loss. All models are implemented in OpenNMTpy (Klein et al., 2017) 2 . The parameters were found by grid search starting from the parameters used in the TGEN model by Dušek and Jurčíček (2016). Unless stated otherwise, models do not block repeat sentence beginnings, since it results in worse performance in automated met-rics. We show results on the multi-reference validation and the blind test sets for the five metrics BLEU (Papineni et al., 2002), NIST (Doddington, 2002), METEOR (Denkowski and Lavie, 2014), ROUGE (Lin, 2004), and CIDEr (Vedantam et al., 2015). Table 2 shows the results of different models on the validation set. During inference, we set the length penalty parameter α to 0.4, the coverage penalty parameter β to 0.1, and use beam search with a beam size of 10. Our models outperform all shown baselines, which represent all published results on this dataset to date. Except for the copyonly condition, the data-efficient dot outperforms mlp. Both copy-attention and diverse ensembling increase performance, and combining the two methods yields the highest BLEU and NIST scores across all conditions. The Transformer performs similarly to the vanilla S2S models, with a lower BLEU but higher ROUGE score. Diverse ensembling also increases the performance with the Transformer model, leading to the highest ROUGE score across all model configurations. Table 3 shows generated text from different models. We can observe that the model without copy attention omits the rating, and without ensem-bling, the sentence structure repeats and thus looks unnatural. With ensembling, both models produce sensible output with different sentence layouts. We note that often, only the better of the two models in the ensemble produces output better than the baselines. We further analyze how many attributes are omitted by the systems in Section 7.3.

Results on the Validation Set
To analyze the effect of length and coverage penalties, we show the average relative change across all metrics for model (8) while varying α and β in Figure 4. Both penalties increase average performance slightly, with an average increase of the scores by up to 0.82%. We find that recallbased metrics increase while the precision-based metrics decrease when applying the penalty, which can be explained by an increase in the average length of the generated text by up to 2.4 words. Results for ensembling variations of model (8) are shown in Table 4. While increasing K can lead to better template representations, every individual model will be trained on fewer data points. This can result in an increased generalization error. Therefore, we evaluate updating the top 2 models during the M-step and setting K=3. While increasing K from 2 to 3 does not show a major increase in performance when updating only one model, the K=3 approach slightly outperforms the K=2 one with the top 2 updates.
Having the K models model completely disjoint data sets and use a disjoint set of parameters could be too strong of a separation. Therefore, we investigate the effect of sharing a subset of the parameters between individual models. Our results in rows (5)-(7) of Table 4 show only a minor improvement in recall-based approaches when sharing the word embeddings between models but at the cost of a much lower BLEU and NIST score. Sharing more parameters further harms the model's performance.

Results on the Blind Test Set
We next report results of experiments on a heldout test set, conducted by the E2E NLG challenge organizers (Dušek et al., 2018), shown in Table 5. The results show the validity of the approach, as our systems outperform competing systems in these; ranking first in ROUGE and CIDEr and sharing the first rank in METEOR. The first row of the table shows the results with blocked repeat sentence beginnings. While this modification leads to slightly reduced scores on the automated Wildwood is a coffee shop providing English food in the moderate price range. It is near Ranch. Its customer rating is 3 out of 5.
(8).1 Wildwood is a moderately priced English coffee shop near Ranch. It has a customer rating of 3 out of 5.

(8).2 Wildwood is an English coffee shop near
Ranch. It has a moderate price range and a customer rating of 3 out of 5.  Table 2.
metrics, it makes the text look more natural, and we thus use this output in the human evaluation. The human evaluation compared the output to 19 other systems. For a single meaning representation, crowd workers were asked to rank output from five systems at a time. Separate ranks were collected for the quality and naturalness of the generations. The ranks for quality aim to reflect the grammatical correctness, fluency, and adequacy of the texts with respect to the structured input. In order to gather ranks for the naturalness, generations were shown without the meaning representation and rated based on how likely an utterance could have been produced by a na-   (2018) and Su et al. (2018) were not evaluated on this set. tive speaker. The results were then analyzed using the TrueSkill algorithm by Sakaguchi et al. (2014). The algorithm produced 5 clusters of systems for both quality and naturalness. Within clusters, no statistically significant difference between systems can be found. In both evaluations, our main system was placed in the second best cluster. One difference between our and the system ranked first in quality by Juraska et al. (2018) et al. (2018) adds an explicit penalty for each attribute that is not part of a generated text, we aim to implicitly reduce this number with the coverage penalty. To investigate the effectiveness of the model extensions, we apply a heuristic that matches an input with exact word matches in the generated text. This provides a lower bound to the number of generated attributes since paraphrases are not captured. We omit the familyFriendly category from this figure since it does not work with this heuristic. In Figure 5 (a) we show the cumulative effect of model extensions on generated attributes across all categories. Copy attention and the coverage penalty have a major effect on this number, while the ensembling only slightly improves it. In Figure 5 (b), we show a breakdown of the generated attributes per category. The base model struggles with area, price range, and customer rating. Price range and customer rating are frequently paraphrased, for example by stating that a restaurant with a 4 out of 5 rating has a good rating, while the area cannot be rephrased. While customer rating is one of the most prevalent attributes in the data set, the other two are more uncommon. The full model improves across almost all of the categories but also has problems with the price range. The only category in which it performs worse is the name category, which could be a side effect of the particular split of the data that the model learned. Despite the decrease in mistakenly omit-(b) (a) ted attributes, the model still misses up to 20% of attributes. We hope to address this issue in future work by explicitly modeling the underlying slots and penalizing models when they ignore them.

Conclusion
In this paper, we have shown three contributions toward end-to-end models for data-to-text problems. We surveyed existing S2S modeling methods and extensions to improve content selection in the NLG problem. We further showed that applying diverse ensembling to model different underlying generation styles in the data can lead to a more robust learning process for noisy data. Finally, an empirical evaluation of the investigated methods showed that they lead to improvements across multiple automatic evaluation metrics. In future work, we aim to extend the shown methods to address generation from more complex inputs, and for challenging domains such as data-todocument generation.