Can Neural Generators for Dialogue Learn Sentence Planning and Discourse Structuring?

Responses in task-oriented dialogue systems often realize multiple propositions whose ultimate form depends on the use of sentence planning and discourse structuring operations. For example a recommendation may consist of an explicitly evaluative utterance e.g. Chanpen Thai is the best option, along with content related by the justification discourse relation, e.g. It has great food and service, that combines multiple propositions into a single phrase. While neural generation methods integrate sentence planning and surface realization in one end-to-end learning framework, previous work has not shown that neural generators can: (1) perform common sentence planning and discourse structuring operations; (2) make decisions as to whether to realize content in a single sentence or over multiple sentences; (3) generalize sentence planning and discourse relation operations beyond what was seen in training. We systematically create large training corpora that exhibit particular sentence planning operations and then test neural models to see what they learn. We compare models without explicit latent variables for sentence planning with ones that provide explicit supervision during training. We show that only the models with additional supervision can reproduce sentence planning and discourse operations and generalize to situations unseen in training.


Introduction
Neural natural language generation (NNLG) promises to simplify the process of producing high quality responses for conversational agents by relying on the neural architecture to automatically learn how to map an input meaning representation (MR) from the dialogue manager to an output utterance (Gašić et al., 2017;Sutskever et al., 2014).
For example, Table 1 shows sample training data for an NNLG with a MR for a restaurant named ZIZZI, along with three reference realizations, that should allow the NNLG to learn to realize the MR as either 1, 3, or 5 sentences.

Sent
Zizzi is moderately priced in riverside, also it isn't family friendly, also it's a pub, and it is an English place near Avalon.

3 Sents
Moderately priced Zizzi isn't kid friendly, it's in riverside and it is near Avalon. It is a pub. It is an English place.

5 Sents
Zizzi is moderately priced near Avalon. It is a pub. It's in riverside. It isn't family friendly. It is an English place. In contrast, earlier models of statistical natural language generation (SNLG) for dialogue were based around the NLG architecture in Figure 1 ( Rambow et al., 2001;Stent, 2002;Stent and Molina, 2009  Here the dialogue manager sends one or more dialogue acts and their arguments to the NLG en-gine, which then makes decisions how to render the utterance using separate modules for content planning and structuring, sentence planning and surface realization (Reiter and Dale, 2000). The sentence planner's job includes: • Sentence Scoping: deciding how to allocate the content to be expressed across different sentences; • Aggregation: implementing strategies for removing redundancy and constructing compact sentences; • Discourse Structuring: deciding how to express discourse relations that hold between content items, such as causality, contrast, or justification.
Sentence scoping (Table 1) affects the complexity of the sentences that compose an utterance, allowing the NLG to produce simpler sentences when desired that might be easier for particular users to understand. Aggregation reduces redundancy, composing multiple content items into single sentences. Table 2 shows common aggregation operations (Cahill et al., 2001;Shaw, 1998). Discourse structuring is often critical in persuasive settings (Scott and de Souza, 1990;Moore and Paris, 1993), in order to express discourse relations that hold between content items. Table 3 shows how RECOMMEND dialogue acts can be included in the MR, and how content can be related with JUSTIFY and CONTRAST discourse relations .
Recent work in NNLG explicitly claims that training models end-to-end allows them to do both sentence planning and surface realization without the need for intermediate representations (Dusek and Jurcícek, 2016b;Lampouras and Vlachos, 2016;Mei et al., 2016;Wen et al., 2015;Nayak et al., 2017). To date, however, no-one has actually shown that an NNLG can faithfully produce outputs that exhibit the sentence planning and discourse operations in Tables 1, 2 and 3. Instead, NNLG evaluations focus on measuring the semantic correctness of the outputs and their fluency (Novikova et al., 2017;Nayak et al., 2017).
Here, we systematically perform a set of controlled experiments to test whether an NNLG can learn to do sentence planning operations. Section 2 describes our experimental setup and the NNLG architecture that allows us, during training, to vary the amount of supervision provided as to The Mill is a coffee shop with a high rating with a high cost and it is an Italian restaurant near The Sorrento. 6 Distributive The Mill is a coffee shop with a high rating and cost, also it is an Italian restaurant near The Sorrento.   (Mairesse and Walker, 2011). To achieve sufficient control for some experiments, we exclusively use Personage training data where we can specify exactly which sentence planning operations will be used and in what frequency. It is not possible to do this with crowdsourced data. While our expectation was that an NNLG can reproduce any sentence planning operation that appears frequently enough in the training data, the results in Sections 3, 4 and 5 show that explicit supervision improves the semantic accuracy of the NNLG, provides the capability to control variation in the output, and enables generalizing to unseen value combinations.

Model Architecture and Experimental Overview
Our experiments focus on sentence planning operations for: (1) sentence scoping, as in Table 1, where we experiment with controlling the number of sentences in the generated output; (2) distributive aggregation, as in Example 6 in Table 2, an aggregation operation that can compactly express a description when two attributes share the same value; and (3) discourse contrast, as in Example 8 in Table 3. Distributive aggregation requires learning a proxy for the semantic property of equality along with the standard mathematical distributive operation, while discourse contrast requires learning a proxy for semantic comparison, i.e. that some attribute values are evaluated as positive (inexpensive) while others are evaluated negatively (poor service), and that a successful contrast can only be produced when two attributes are on opposite poles (in either order), as defined in Figure 2 Our goal is to test how well NNLG models can produce realizations of these sentence planning operations with varying levels of supervision, while simultaneously achieving high semantic fidelity. Figure 3 shows the general architecture, implemented in Tensorflow, based on TGen, an open-source sequence-to-sequence (seq2seq) neural generation framework (Abadi and others., 2015; Dusek and Jurcícek, 2016a). 2 The model uses seq2seq generation with attention (Bahdanau et al., 2014;Sutskever et al., 2014) with a sequence of LSTMs (Hochreiter and Schmidhuber, 1997) for encoding and decoding, along with beamsearch and an n-best reranker.
The input to the sequence to sequence model is a sequence of tokens x t , t ∈ {0, . . . , n} that represent the dialogue act and associated arguments. Each x i is associated with an embedding vector w i of some fixed length. Thus for each MR, TGen takes as input the dialogue acts represent- 1 We also note that the evaluation of an attribute may come from the attribute itself, e.g. "kid friendly", or from its adjective, e.g. "excellent service". 2 https://github.com/UFAL-DSG/tgen ing system actions (recommend and inform acts) and the attributes and their values (for example, an attribute might be price range, and its value might be moderate), as shown in Table 1. The MRs (and resultant embeddings) are sorted internally by dialogue act tag and attribute name. For every MR in training, we have a matching reference text, which we delexicalize in pre-processing, then re-lexicalize in the generated outputs. The encoder reads all the input vectors and encodes the sequence into a vector h n . At each time step t, it computes the hidden layer h t from the input w t and hidden vector at the previous time step h t−1 , following: All experiments use a standard LSTM decoder. We test three different dialogue act and input vector representations, based on the level of supervision, as shown by the two input vectors in Figure 3: (1) models with no supervision, where the input vector simply consists of a set of inform or recommend tokens each specifying an attribute and value pair, and (2) models with a supervision token, where the input vector is supplemented with a new token (either period or distribute or contrast), to represent a latent variable to guide the NNLG to produce the correct type of sentence planning operation; (3) models with semantic supervision, tested only on distributive aggregation, where the input vector is supplemented with specific instructions of which attribute value to distribute over, e.g. low, average or high, in the DIS-TRIBUTE token. We describe the specific model variations for each experiment below. Data Sets. One challenge is that NNLG models are highly sensitive to the distribution of phenomena in training data, and our previous work has shown that the outputs of NNLG models exhibit less stylistic variation than their training data (Oraby et al., 2018b). Moreover, even large corpora, such as the 50K E2E Generation Challenge corpus, may not contain particular stylistic variations. For example, out of 50K crowdsourced examples in the E2E corpus, there are 1,956 examples of contrast with the operator "but". There is only 1 instance of distributive aggregation because attribute values are rarely lexicalized identically in E2E. To ensure that the training data contains enough examples of particular phenomena, our experiments combine crowdsourced E2E data 3 with automatically generated data from PERSON-AGE (Mairesse and Walker, 2011). 4 This allows us to systematically create training data that exhibits particular sentence planning operations, or combinations of them. The E2E dataset consists of pairs of reference utterances and their meaning representations (MRs), where each utterance contains up to 8 unique attributes, and each MR has multiple references. We populate PERSONAGE with the syntax/meaning mappings that it needs to produce output for the E2E meaning representations, and then automatically produce a very large (204,955 utterance/MR pairs) systematically varied sentence planning corpus. 5 Evaluation metrics. It is well known that evaluation metrics used for translation such as BLEU are not well suited to evaluating generation outputs (Belz and Reiter, 2006;Liu et al., 2016;Novikova et al., 2017): they penalize stylistic variation, and don't account for the fact that different dialogue responses can be equally good, and can vary due to contextual factors (Jordan, 2000; 3 http://www.macs.hw.ac.uk/ InteractionLab/E2E/ 4 Source code for PERSONAGE was provided by François Mairesse. 5 We make available the sentence planning for NLG corpus at: nlds.soe.ucsc.edu/sentence-planning-NLG. Krahmer et al., 2002). We also note that previous work on sentence planning has always assumed that sentence planning operations improve the quality of the output (Barzilay and Lapata, 2006;Shaw, 1998), while our primary focus here is to determine whether an NNLG can be trained to perform such operations while maintaining semantic fidelity. Moreover, due to the large size of our controlled training sets, we observe few problems with output quality and fluency.
Thus we leave an evaluation of fluency and naturalness to future work, and focus here on evaluating the multiple targets of semantic accuracy and sentence planning accuracy. Because the MR is clearly defined, we define scripts (information extraction patterns) to measure the occurrence of the MR attributes and their values in the outputs. We then compute Slot Error Rate (SER) using a variant of word error rate: H is the number of hallucinations and N is the number of slots in the input MR.
We also define scripts for evaluating the accuracy of the sentence planner's operations. We check whether: (1) the output has the right number of sentences; (2) attributes with equal values are realized using distributive aggregation, and (3) discourse contrast is used when semantically appropriate. Descriptions of each experiment and the results are in Section 3, Section 4, and Section 5.

Sentence Scoping Experiment
To test whether it is possible to control basic sentence scoping with an NNLG, we experiment first with controlling the number of sentences in the generated output, as measured using the period operator. See Table 1. We experiment with two different models: • No Supervision: no additional information in the MR (only attributes and their values) • Period Count Supervision: has an additional supervision token, PERIOD, specifying the number of periods (i.e. the number of sentences) to be used in the output realization.
For sentence scoping, we construct a training set of 64,442 output/MR pairs and a test set of 398 output/MR pairs where the reference utterances for the outputs are generated from PERSONAGE. Table 4 shows the number of training instances for each MR size for each period count. The right frontier of the table shows that there are low frequencies of training instances where each proposition in the MR is realized in its own sentence (Period = Number of MR attrs -1). The lower left hand side of the table shows that as the MRs get longer, there are lower frequencies of utterances with Period=1.  We start with the default TGen parameters and monitor the losses on Tensorboard on a subset of 3,000 validation instances from the 64,000 training set. The best settings use a batch size of 20, with a minimum of 5 epochs and a maximum of 20 (with early-stopping based on validation loss). We generate outputs on the test set of 398 MRs. Sentence Scoping Results. Table 5 shows the accuracy of both models in terms of the counts of the output utterances that realize the MR attributes in the specified number of sentences. In the case of NOSUP, we compare the number of sentences in the generated output to those in the corresponding test reference, and for PERIODCOUNT, we compare the number of sentences in the generated output to the number of sentences we explicitly encode in the MR. The table shows that the NO-SUP setting fails to output the correct number of sentences in most cases (only a 22% accuracy), but the PERIODCOUNT setting makes only 2 mistakes (almost perfect accuracy), demonstrating almost perfect control of the number of output sentences with the single-token supervision. We also show correlation levels with the gold-standard references (all correlations significant at p ≤ 0.01).  Generalization Test. We carry out an additional experiment to test generalization of the PERIOD-COUNT model, where we randomly select a set of 31 MRs from the test set, then create a set instance for each possible PERIOD count value, from 1 to the N-1, where N is the number of attributes in that MR (i.e. PERIOD=1 means all attributes are realized in the same sentence, and PERIOD=N-1 means that each attribute is realized in its own sentence, except for the restaurant name which is never realized in its own sentence). This yields 196 MR and reference pairs. This experiment results in an 84% accuracy (with correlation of 0.802 with the test refs, p ≤ 0.01). When analyzing the mistakes, we observe that all of the scoping mistakes the model makes (31 in total) are the case of PERIOD=N-1. These cases correspond to the right frontier of Table 4 where there were fewer training instances. Thus while the period supervision improves the model, it still fails on cases where there were few instances in training. Complexity Experiment. We performed an additional sentence scoping experiment where we specified a target sentence complexity instead of a target number of sentences, since this may more intuitively correspond to a notion of reading level or sentence complexity, where the assumption is that longer sentences are more complex (Howcroft et al., 2017;Siddharthan et al., 2004). We used the same training and test data, but labeled each reference as either high, medium or low complexity. The number of attributes in the MR does not include the name attribute, since that is the subject of the review. A reference was labeled high when there are > 2 attributes per sentence, medium when the number of attributes per sentence is > 1.5 and ≤ 2 and low when there are ≤ 1.5 attributes per sentence.
This experiment results in 89% accuracy. Most of the errors occur when the labeled complexity was medium. This is most likely because there is often only one sentence difference between the two complexity labels. This indicates that sentence scoping can be used to create references with either exactly the number of sentences requested or categories of sentence complexity.

Distributive Aggregation Experiment
Aggregation describes a set of sentence planning operations that combine multiple attributes into X has Y, also it has Z. DISTRIB X has Y and Z. single sentences or phrases. We focus here on distributive aggregation as defined in Figure 2 and illustrated in Row 6 of Table 2. In an SNLG setting, the generator achieves this type of aggregation by operating on syntactic trees (Shaw, 1998;Scott and de Souza, 1990;Stent et al., 2004;Walker et al., 2002b). In an NNLG setting, we hope the model will induce the syntactic structure and the mathematical operation underlying it, automatically, without explicit training supervision.
To prepare the training data, we limit the values for PRICE and RATING attributes to LOW, AVERAGE, and HIGH. We reserve the combination {PRICE=HIGH, RATING=HIGH} for test, leaving two combinations of values where distribution is possible ({PRICE=LOW, RATING=LOW} and {PRICE=AVERAGE, RATING=AVERAGE}). We then use all three values in MRs where the price and rating are not the same {PRICE=LOW, RAT-ING=HIGH}. This ensures that the model does see the value HIGH in training, but never in a setting where distribution is possible. We always distribute when possible, so every MR where the values are the same uses distribution. All other opportunities for aggregation, in the same sentence or in other training sentences, use the other aggregation operations defined in PERSONAGE as specified in Table 6, with equal probability.  The aggregation training set contains 63,690 total instances, with 19,107 instances for each of the two combinations that can distribute, and 4,246 instances for each of the six combinations that can't distribute. The test set contains 408 MRs, 288 specify distribution over HIGH (which we note is not a setting seen in train, and explicitly tests the models' ability to generalize), 30 specify distribution over AVERAGE, 30 over LOW, and 60 are examples that do not require distribution (NONE). We test whether the model will learn the equality relation independent of the value (HIGH vs. LOW), and thus realize the aggregation with HIGH. The distributive aggregation experiment is based on three different models: • No Supervision: no additional information in the MR (only attributes and their values) • Binary Supervision: we add a supervision token, DISTRIBUTE, containing a binary 0 or 1 indicating whether or not the corresponding reference text contains an aggregation operation over attributes price range and rating. • Semantic Supervision: we add a supervision token, DISTRIBUTE, containing a string that is either none if there is no aggregation over price range and rating in the corresponding reference text, or a value of LOW, AVERAGE, or HIGH for aggregation.
As above, we start with the default TGen parameters and monitor the losses on Tensorboard on subset of 3,000 validation instances from the 63,000 training set. The best settings use a batch size of 20, with a minimum of 5 epochs and a maximum of 20 epochs with early-stopping. Distributive Aggregation Results. Table 7 shows the accuracy of each model overall on all 4 values, as well as the accuracy specifically on HIGH, the only distribution value unseen in train. Model NO-SUP has a low overall accuracy, and is completely unable to generalize to HIGH, which is unseen in training. It is frequently able to use the HIGH value, but is not able to distribute (generating output like high cost and cost). Model BINARY is by far the best performing model, with an almost perfect accuracy (it is able to distribute over LOW and AVERAGE perfectly), but makes some mistakes when trying to distribute over HIGH; specifically, while it is always able to distribute, it may use an incorrect value (LOW or AVERAGE). Whenever BINARY correctly distributes over HIGH, it interestingly always selects attribute RATING before COST, realizing the output as high rating and price. Also, BINARY is consistent even when it incorrectly uses the value LOW instead of HIGH: it always selects the attribute price before rating. To our surprise, Model SEMANTIC does poorly, with 36% overall accuracy, and only 9% accuracy Xname is a low customer rated coffee shop offering xcuisine food in the xlocation. Yes, it is child friendly, but the price range is more than $30.

Discourse Contrast Experiment
Persuasive settings such as recommending restaurants, hotels or travel options often have a critical discourse structure (Scott and de Souza, 1990;Moore and Paris, 1993;Nakatsu, 2008). For example a recommendation may consist of an explicitly evaluative utterance e.g. Chanpen Thai is the best option, along with content related by the justify discourse relation, e.g. It has great food and service, as in Table 3. Our experiments focus on DISCOURSE-CONTRAST.
We developed a script to find contrastive sentences in the 40K E2E training set by searching for any instance of a contrast cue word, such as but, although, and even if. This identified 3,540 instances. While this data size is comparable to the 3-4K instances used in prior work (Wen et al., 2015;Nayak et al., 2017), we anticipated that it might not be enough data to properly test whether an NNLG can learn to produce discourse contrast. We were also interested in testing whether synthetic data would improve the ability of the NNLG to produce contrastive utterances while maintaining semantic fidelity. Thus we used PERSONAGE with its native database of New York City restaurants (NYC) to generate an additional 3,500 examples of one form of contrast using only the discourse marker but, which are most similar to the examples in the E2E data. Table 8 illustrates both PERSONAGE and E2E contrast examples. While PERSONAGE also contains JUSTIFICATIONS, which could possibly confuse the NNLG, it offers many more attributes that can be contrasted and thus more unique instances of contrast. We create 4 training datasets with contrast data in order to systematically test the effect of the combined training set. Table 9 provides an overview of the training sets, with their rationales below. 3K Training Set. This dataset consists of all instances of contrast in the E2E training data, i.e. 3,540 E2E references. 7K Training Set. We created a training set of 7k references by supplementing the E2E contrastive references with an equal number of PERSONAGE references. 11K Training Set. Since 7K is smaller than desirable for training an NNLG, we created several additional training sets with the aim of helping the model learn to correctly realize domain semantics while still being able to produce contrastive utterances. We thus added an additional 4K crowdsourced E2E data that was not contrastive to our training data, for a total of 11,065. See Table 9. 21K Training Set. We created an additional larger training set by adding more E2E data, again to test the effect of increasing the size of the training set on realization of domain semantics, without a significant decrease in our ability to produce contrastive utterances. We added an additional 14K E2E references, for a total of 21,065. See Table 9.
We perform two experiments with the 21K training set. First we trained on the MR and reference exactly as we had done for the 7K and 11K training sets. The second experiment added a contrast token during training time with values of either 1 (contrast) or 0 (no contrast) to test if that would achieve better control of contrast.
Contrast Test Sets. To have a potential for contrast there must be an attribute with a positive value and another attribute with a negative value in the same MR. We constructed 3 different test sets, two for E2E and one for NYC. We created a delexicalized version of the test set used in the E2E generation challenge. This resulted in a test of 82 MRs of which only 25 could support contrast (E2E TEST). In order to allow for a better test of contrast, we constructed an additional test set of 500 E2E MRs all of which could support contrast (E2E CONTRAST TEST). For the NYC test, which provides many opportunities for contrast, we created a dataset of 785 MRs that were different than those seen in training (NYC TEST). At test time, in the 21K contrast token experiment, we utilize the contrast token as we did in training.     Table 11, and NYC test set in Table 12. examples gets a correct contrast of .41 but a much higher slot error rate. Interestinglyx, the 11K dataset is much better than the 3K for contrast correct, suggesting a positive effect for the automatically generated contrast examples along with more E2E training data. The 21K set without the contrast token does not attempt contrast since the frequency of contrast data is low, but with the CONTRAST token, it attempts contrast every time it is possible (25/25 instances).
In Table 11 with only contrast data, we see similar trends, with the lowest slot error rate (.16) and highest correct contrast (.75) ratios for the experiment with token supervision on 21K. Again, we see much better performance from the 11K set than the 3K and 7K in terms of slot error and correct contrast, indicating that more training data (even if that data does not contain contrast) helps the model. As before, we see very low contrast attempts with 21K without supervision, with a huge increase in the number of contrast attempts when using token supervision (422/500). Table 12 also shows large performance improvements from the use of the CONTRAST token supervision for the NYC test set, again with improvements for the 21K CONTRAST in both slot error rate and in correct contrast. Interestingly, while we get the highest correct contrast ratio of .85 with 21K CONTRAST, we actually see fewer contrast attempts, showing that the most explicitly supervised model is becoming more selective when deciding when to do contrast. When training on the 7K dataset, the neural model always produces a contrastive utterance for the NYC MRs (all the NYC data is contrastive). Although it never sees any NYC non-contrastive MRs, the additional E2E training data allows it to improve its ability to decide when to contrast (Row 21K CON-TRAST) as well as improving the slot error rate in the final experiment.
Much of the previous work focused on sentence planning was done in the framework of statistical NLG, where each module was assumed to require training data that matched its representational requirements. Methods focused on training individual modules for content selection and linearization (Marcu, 1997;Lapata, 2003;Barzilay and Lapata, 2005), and trainable sentence planning for discourse structure and aggregation operations (Stent and Molina, 2009;Walker et al., 2007;Paiva and Evans, 2004;Sauper and Barzilay, 2009;H. Cheng and Mellish, 2001). Previous work also explored statistical and hybrid methods for surface realization (Langkilde and Knight, 1998;Bangalore and Rambow, 2000;Oh and Rudnicky, 2002). and text-to-speech realizations (Hitzeman et al., 1998;Bulyko and Ostendorf, 2001;Hirschberg, 1993).
Other work on NNLG also uses token supervision and modifications of the architecture in order to control stylistic aspects of the output in the context of text-to-text or paraphrase generation. Some types of stylistic variation correspond to sentence planning operations, e.g. to express a particular personality type (Oraby et al., 2018b;Mairesse and Walker, 2011;Oraby et al., 2018a), or to control sentiment and sentence theme (Ficler and Goldberg, 2017). Herzig et al. (2017) automatically label the personality of customer care agents and then control the personality during generation. Rao and Tetreault (2018) train a model to paraphrase from formal to informal style and Niu and Bansal (2018) use a high precision classifier and a blended language model to control utterance politness.
Previous work on contrast has explored how the user model determines which values should be contrasted, since people may have differing opinions about whether an attribute value is positive or negative (e.g. family friendly) (Carenini and Moore, 1993;Walker et al., 2002a;White et al., 2010). To our knowledge, no-one has yet trained an NNLG to use a model of user preferences for content selection. Here, values are treated as inherently good or bad, e.g. service is ranked from great to terrible.

Discussion and Conclusion
This paper presents detailed, systematic experiments to test the ability of NNLG models to produce complex sentence planning operations for re-sponse generation. We create new training and test sets designed specifically for testing sentence planning operations for sentence scoping, aggregation and discourse contrast, and train novel models with increasing levels of supervision to examine how much information is required to control neural sentence planning. The results show that the models benefit from extra latent variable supervision, which improves the semantic accuracy of the NNLG, provides the capability to control variation in the output, and enables generalizing to unseen value combinations.
In future work we plan to test these methods in different domains, e.g. the WebNLG challenge or WikiBio dataset (Wiseman et al., 2018;Colin et al., 2016). We also plan to experiment with more complex sentence planning operations and test whether an NNLG system can be endowed with fine-tuned control, e.g. controlling multiple aggregation operations. Another possibility is that hierarchical input representations representing the sentence plan might improve performance or allow finer-grained control (Moore et al., 2004;Su and Chen, 2018;Bangalore and Rambow, 2000). It may be desirable to control which attributes are aggregated together, distributed or contrasted, and to allow more than two values to be contrasted.
Here, our main goal was to test the ability of different neural architectures to learn particular sentence planning operations that have been used in previous work in SNLG. Because we don't make claims about fluency or naturalness, we did not evaluate these with human judgements. Instead, we focused our evaluation on automatic assessment of semantic fidelity, and the extent to which the neural architecture could reproduce the desired sentence planning operations. In future work, we hope to quantify the extent to which human subjects prefer the outputs where the sentence planning operations have been applied.

Acknowledgments
This work was supported by NSF Cyberlearning EAGER grant IIS 1748056 and NSF Robust Intelligence IIS 1302668-002 as well as an Amazon Alexa Prize Gift 2017 and Grant 2018 awarded to the Natural Language and Dialogue Systems Lab at UCSC.