Constrained Decoding for Neural NLG from Compositional Representations in Task-Oriented Dialogue

Generating fluent natural language responses from structured semantic representations is a critical step in task-oriented conversational systems. Avenues like the E2E NLG Challenge have encouraged the development of neural approaches, particularly sequence-to-sequence (Seq2Seq) models for this problem. The semantic representations used, however, are often underspecified, which places a higher burden on the generation model for sentence planning, and also limits the extent to which generated responses can be controlled in a live system. In this paper, we (1) propose using tree-structured semantic representations, like those used in traditional rule-based NLG systems, for better discourse-level structuring and sentence-level planning; (2) introduce a challenging dataset using this representation for the weather domain; (3) introduce a constrained decoding approach for Seq2Seq models that leverages this representation to improve semantic correctness; and (4) demonstrate promising results on our dataset and the E2E dataset.


Introduction
Generating fluent natural language responses from structured semantic representations is a critical step in task-oriented conversational systems. With their end-to-end trainability, neural approaches to natural language generation (NNLG), particularly sequence-to-sequence (Seq2Seq) models, have been promoted with great fanfare in recent years (Wen et al., 2015(Wen et al., , 2016Mei et al., 2016;Kiddon et al., 2016;Dušek and Jurcicek, 2016), and avenues like the recent E2E NLG challenge (Dušek et al., 2018(Dušek et al., , 2019 have made available large datasets to promote the development of these models. Nevertheless, current NNLG models arguably remain inadequate for most real-world * Alphabetical by first name † Work done while on leave from Ohio State University task-oriented dialogue systems, given their inability to (i) reliably perform common sentence planning and discourse structuring operations (Reed et al., 2018), (ii) generalize to complex inputs (Wiseman et al., 2017), and (3) avoid generating texts with semantic errors including hallucinated content (Dušek et al., 2018(Dušek et al., , 2019. 1 In this paper, we explore the extent to which these issues can be addressed by incorporating lessons from pre-neural NLG systems into a neural framework. We begin by arguing in favor of enriching the input to neural generators to include discourse relations -long taken to be central in traditional NLG -and underscore the importance of exerting control over these relations when generating text, particularly when using user models to structure responses. In a closely related work, Reed et al. (2018), the authors add control tokens (to indicate contrast and sentence structure) to a flat input MR, and show that these can be effectively used to control structure. However, their methods are only able to control the presence or absence of these relations, without more fine-grained control over their structure. We thus go beyond their approach and propose using full tree structures as inputs, and generating treestructured outputs as well. This allows us to define a novel method of constrained decoding for standard sequence-to-sequence models for generation, which helps ensure that the generated text contains all and only the specified content, as in classic approaches to surface realization.
On the E2E dataset, our experiments demonstrate much better control over CONTRAST relations than using Reed et al.'s method, and also show improved diversity and expressiveness over standard baselines. We also release a new dataset of responses in the weather domain, which includes the JUSTIFY, JOIN and CONTRAST rela- tions, and where discourse-level structures come into play. On both E2E and weather datasets, we show that constrained decoding over our enriched inputs results in higher semantic correctness as well as better generalizability and data efficiency.
The rest of this paper is organized as follows: Section 2 describes the motivation for using compositional inputs organized around discourse relations. Section 3 explains our data collection approach and dataset. 2 Section 4 shows how to incorporate compositional inputs into NNLG and describes our constrained decoding algorithm. Section 5 presents our experimental setup and results.
2 Towards More Expressive Meaning Representations

Limitations of Flat MRs
In the E2E dataset, meaning representations (MRs) are a flat list of key-value pairs, where each key is a slot name that needs to be mentioned, and the value is the value of that slot (see Table 1). In Wen et al. (2015), MRs have a similar structure, and additionally contain information about the dialog act that needs to be conveyed (REQUEST, INFORM, etc.). These MRs are sufficient to capture basic semantic information, but fail to capture rhetorical (or discourse) relations, like CONTRAST, that have long been taken to be central to generating coherent discourse in tradi-2 The datasets and implementations can be found at https://github.com/facebookresearch/ TreeNLG. tional NLG (Mann and Thompson, 1988;Moore and Paris, 1993;Reiter and Dale, 2000;Stent et al., 2002). The two references in Table 1 illustrate this problem with the expressiveness of such flat MRs. Critical discourse information, like whether two attributes should be contrasted (or whether to justify a recommendation, etc.), is not captured by the MR. This poses a dual challenge: First, since the MR does not specify these discourse relations, crowdworkers creating the dataset in turn have no instructions on when to use them, and must thus use their own judgment in creating a natural-sounding response. While the E2E organizers tout the resulting response variations as a plus, Reed et al. (2018) find that current neural systems are unable to learn to express discourse relations effectively with this dataset, and explore ways of enriching input MRs to do so. Indeed, now that the E2E system outputs have been released, a search through outputs from all participating systems reveals only 43 outputs (0.4% out of 10080) containing contrastive tokens, on a test set containing about 300 contrastive samples. 3 Second, going beyond Reed et al., we argue that the controllability of these relations through MRs is desirable in live conversational systems, where external knowledge like user models may inform decisions around contrast, grouping, or justifications. While several studies have shown that controlling such discourse behaviors can be critical to user perceptions of quality and naturalness (Lemon et al., 2004;Carenini and Moore, 2006;Walker et al., 2007;White et al., 2010;Demberg et al., 2011), flat MRs provide no means to do so. This leaves it to the neural model to learn general trends in the data, such as contrasting a good attribute like a 5-star rating with a typically dispreferred attribute like not being family friendly or serving English food. However, sometimes people are interested in adult-oriented establishments, and some people may even like English food; for users with these preferences, text generated according to general trends will be incoherent. For example, for a user known to be seeking an adult-oriented locale, Ref. 1 in Table 1 would be incoherent, and less preferable than a non-contrastive alternative such as JJ's Pub is a highly-rated restaurant for adults near the Crowne Plaza Hotel.

Tree-Structured MRs
In order to overcome these challenges, we propose the use of structured meaning representations like those explored widely in (hybrid) rule-based NLG systems (Rambow et al., 2001;Reiter and Dale, 2000;Walker et al., 2007). Our representation consists of three parts: 1. Argument can be any entity or slot mentioned in a response, like the name of a restaurant or the date. Some arguments can be complex and contain sub-arguments (e.g. a date time argument has subfields like week day and month). 2. Dialog act is an atomic unit that could correspond linguistically to a single clause. A dialog act can contain one or more arguments that need to be expressed. Examples: IN-FORM, YES, RECOMMEND. 3. Discourse relation defines the relationships between dialog acts. A single discourse relation may contain multiple other dialog or discourse relations, allowing for potentially arbitrary degrees of nesting. Examples: JOIN, JUSTIFY, CONTRAST. A meaning representation that uses this formulation can consist of an arbitrary number and combination of discourse relations and dialog acts, resulting in a nested tree-structured MR with much higher expressiveness and specificity. Table 1, seen earlier, shows an example of an MR structured in this way, as well as the corresponding "flat" MR and its reference in the E2E dataset.
In addition to improved expressiveness, this representation results in more atomic definitions of dialog acts and arguments than in flat MRs. For example, consider the example in the weather domain from Table 2: The response contains multiple dialog acts, a contrast and several instances of ellipsis and grouping (i.e., temperatures are grouped and mentioned separately from wind condition). Additionally, some arguments, like date time, occur multiple times in the response and correspond to different dialog acts, with several different values. A flat MR will struggle to represent 1) the correspondence of arguments to dialog acts; 2) what attributes to group and contrast and 3) semantic equivalence of arguments like date time1 and date time2. On the other hand, our MRs ease discourse-level learning and encourage reuse of arguments across multiple dialog acts.

Dataset
With this representation in mind, we created an ontology of dialog acts, discourse relations, and arguments, for the weather domain. Our motivation for choosing the weather domain, as explored in (Liang et al., 2009), is that this domain offers significant complexity for NLG. Weather forecast summaries in particular can be very long, and require reasoning over several disjoint pieces of information. In this work, we focused on collecting a dataset that showcases the complexity of weather summaries over date/time ranges. Our weather dataset is also unique in that it was collected in a conversational setup (see below).
We collected our dataset in multiple stages: 1. Query collection. We asked crowdworkers to come up with sample queries in the weather domain, like What's the weather like tomorrow? and Do I need an umbrella tonight?
2. Query annotation. We then wrote rules to automatically parse these queries, and extract key pieces of information, like the location, date, and any attributes that the user specifically requested in the question.
3. MR generation. Our goal was to create MRs that are sufficiently expressive and straightforward to create automatically in a practical system. In the weather domain, it's conceivable that the NLG system has access to a weather API that provides it with detailed weather forecasts for the range requested by the user. To mimic this setting, we generated artificial weather forecasts for every user query based on the arguments (full argument set in Table 3) in the user query. We then created the tree-structured MR by applying a few different types of automatic rules, like adding CONTRAST to weather conditions that are in opposition. We add more details of our response generation method and the specific rules for MR creation in Appendix A and B.
4. Response generation and annotation. We presented these tree-structured MRs to trained annotators, and asked them to write responses that expressed the MRs. They were also given the user query and asked to make their responses natural given the query. They were allowed to elide in-Reference It'll be sunny throughout this weekend. The high will be in the 60s, but expect temperatures to drop as low as 43 degrees by Sunday evening. There's also a chance of strong winds on Saturday morning.   Table 3: Ontology for the weather domain dataset that we collected. Arguments marked with * are nested arguments (see Table 4). [n] indicates arguments that have a corresponding not argument; [s] indicates arguments that have a corresponding summary.
formation when arguments were repeated across dialog acts, and could choose the most appropriate surface forms for any arguments based on contextual clues (e.g. referring to a date as tomorrow, rather than April 24 th , depending on the user's date). Finally, we asked them to label response spans corresponding to each argument, dialog act, and discourse relation in the MR. 5. Quality evaluation. Finally, we presented a different group of annotators with the annotated responses, and asked them to provide evaluations of fluency, correctness, naturalness, and annotation correctness.

Dataset statistics
Our final dataset has 33,493 examples. Each example comprises a user query, the synthetic user context (datetime and location), the tree-structured MR, the response, and a complete tree-structured annotation of the response. Table 6 contains an example from our dataset; as shown, the response annotation structure closely mirrors that of the MR itself. The MRs and responses in the dataset range from very simple (a single dialog act) to very complex (an MR with a depth and width of 4). A distribution of this complexity is shown in Table 5. The vocabulary size is 1485, and the max/average/min lengths of responses are 151/40.6/8. The dataset also poses several challenges in addition to   syntactic and semantic complexity. As mentioned before, it has a rich set of referring expressions for dates and date ranges. It also contains user queries on which the written response was based, thus creating the opportunity for studies on improving naturalness or relevance with respect to the user query. These could be useful in particular for learning to express recommendations and justifications, as well as YES and NO dialog acts. Our final training set contains 25,390 examples, with 11,879 unique MRs. (We consider two MRs to be identical if they have the same delexicalized tree structure -see Section 4.1.) The test set contains 3,121 examples, of which 1.1K (35%) have unique MRs that have never been seen in the training set.

Enriched E2E Dataset
We also used heuristic techniques to convert the E2E dataset to use tree-structured MRs. We used the output of Juraska et al.'s (2018) tagger to find a character within each slot in the flat MR, and automatically adjusted these to correspond to a token boundary if they didn't already. We then used the Viterbi segmentations from the model released by Wiseman et al. (2018) to get spans corresponding to each argument. Finally, we used the Berkeley neural parser (Kitaev and Klein, 2018) to identify spans coordinated by but, and added CONTRAST relations as parents of the coordinated arguments. We added JOIN based on sentence boundaries. An interesting direction for future research would be

Seq2Seq with Linearized Trees
In this work, we use a standard Seq2Seq model with attention Bahdanau et al., 2014), implemented in the fairseq-py repository (Gehring et al., 2017). The encoder and decoder are both Long Short-Term Memory (LSTM) -based (Hochreiter and Schmidhuber, 1997) and the decoder uses beam search for generation. The input to the model is a linearized representation of the tree-structured MR, and the output is a linearized tree-structured representation of the annotated response (see Table 6). This means that in addition to predicting tokens for the surface realization of the response, the model must also predict non-terminals (dialog/discourse relations and arguments) to indicate the start or end of each span. One advantage of predicting a tree structure is that the model has supervision on the alignment between the MR and the response. Additionally, this predicted tree structure can be used to help verify the correctness of the predicted response; we leverage this for our constrained decoding approach described next. We also delexicalized tokens in the response that correspond to sparse entities, like names in the E2E dataset and temperatures in the weather dataset (see Appendix D).

Constrained Decoding
As described above, the output structure predicted by the model forms a tree that should correspond neatly to the input MR, barring some instances of ellipses (as with the date time argument in  1) and (2) are valid outputs. (3) fails to meet tree constraints since the CONTRAST node is not present and the IN-FORM node has illegal children customerrating and pricerange. Table 6). 4 Thus, the input MR can be seen as a constraint on the semantic correctness of the prediction; if the predicted structure doesn't match the MR, the prediction is incorrect and can be rejected. Figure 1 illustrates such ideas.
Our beam search algorithm works as follows. 5 First, the input tree is scanned to identify groups of two or more nodes that have the same value, so that ellipsis can be enabled by optionally allowing just one node in each group. Then, as the tree structure is incrementally decoded, nonterminals are checked against the input tree for validity. When an opening bracket token (e.g., [name) is generated, it is not accepted if it isn't a child of the current parent node in the input tree, or has already been generated in the current subtree, thereby preventing repetition and hallucination of arguments or acts. When a closing bracket token ] or an end-of-sentence (EOS) token is generated, it is accepted only if all children of the current parent are covered either directly or through ellipsis, thus ensuring that all children of every node are generated. After each timestep of the beam search, the scores of candidates that violate tree constraints are masked so that they do not proceed forward. By removing candidates that violate the constraints early in the beam search, we allow the decoder to explore more hypotheses.
Checking these constraints and tracking coverage requires an alignment between the output and input MRs. While the children of JOIN nodes are required to appear in order, child nodes of other discourse relations and dialogue acts can appear in any order, and thus the corresponding input nonterminal is not always uniquely identifiable when an output non-terminal is opened. For this reason, a set of possible alignments is maintained. In particular, when accepting a non-terminal, all possible nodes in the input that it may correspond to are identified and a state is maintained for each possibility. Open states whose constraints are violated are removed from tracking, and a non-terminal is not accepted when no more open states are left. Though in principle the number of open states could grow large, empirically any alignment nondeterminism is quickly resolved.
Note that although the algorithm ensures that the output tree structure is compatible with the input structure, it turns out that the model can still occasionally hallucinate content: since the neural model allows all possible token sequences in principle, it sometimes generates word sequences that express a hallucinated slot by simply skipping over the disallowed slot annotation-thereby bypassing the constraints-especially when given an unusual input. These cases are discussed further below.

Experiments
In this section, we first describe our baselines, metrics, and implementation details, followed by experimental results and analyses.

Experimental Setup
Baselines We consider a few Seq2Seq-based baselines in our experiments (we use the open fairseq implementation (Gehring et al., 2017) for all our experiments). All models use an LSTMbased encoder and decoder, with attention.

S2S-FLAT
The input is a flat MR (for the E2E dataset, this is equivalent to the original form of the data; for weather, we remove all discourse relations and treat all dialog acts as a single large MR). The output is the raw delexicalized response. Following Reed et al. (2018), we add three tokens in the beginning of flat input MR (same as S2S-FLAT) to indicate the number of contrasts, joins and number of sentences (dialog acts) to be generated. 6 The output is the raw delexicalized response. S2S-TREE Same architecture as S2S-FLAT, but the input and output for this model are the linearized tree-structured MR and the treestructured response respectively. S2S-CONSTR Our proposed model. It has the same architecture as S2S-TREE, but decoding during beam search is constrained, as described in Section 4.2.

S2S-TOKEN
Data preprocessing In the input MR, all arguments within each dialog act are ordered alphabetically, to ensure a consistent ordering across examples. We also use alignments between the reference and the MR to filter information (arguments or dialog acts/discourse relations) that are not expressed in the reference; however, we ensure that any arguments that occur multiple times in the MR, but are elided in the reference for redundancy, are still preserved in the MR. This ensures that the model doesn't have to learn content selection, while still achieving our primary goal of discourse structure control. The inputs to S2S-FLAT and S2S-TOKEN are prepared by removing all dialog act and discourse information in the linearized MR, and numbering arguments corresponding to the dialog act they belong in. Global order of dialog acts is preserved such that arguments of the first act occur before those arguments in the following acts, but arguments within a dialog act are ordered alphabetically.
Metrics We consider automatic and human evaluation metrics for our model. Automatic metrics are evaluated on the raw model predictions (which have delexicalized fields, like temp low): • Tree accuracy is a novel metric that we introduce for this problem. It measures whether the tree structure in the prediction matches that of the input MR exactly. We implemented our tree accuracy metric to account for grouping and ellipsis, and will release this implementation along with our dataset.
• BLEU-4 (Papineni et al., 2002) is a wordoverlap metric commonly used for evaluating NLG systems. Due to the limitations of automatic metrics for NLG (Novikova et al., 2017;Reiter, 2018), we also performed human evaluation studies by asking annotators to evaluate the quality of responses produced by different models. Annotators provided binary ratings on the following dimensions: • Grammaticality: Measures fluency of the responses. Our evaluation guidelines included considerations for proper subject-verb agreement, word order, repetition, and grammatical completeness. • Correctness: Measures semantic correctness of the responses. Our guidelines included considerations for sentence structure, contrast, hallucinations (incorrectly included attributes), and missing attributes. We asked annotators to evaluate model predictions against the reference (rather than the MR -see Appendix F).

Constrained Decoding Analysis
We trained each of the models described above on the weather dataset and the E2E dataset, and evaluated automatic metrics on the test set. 7 In the E2E test set, each flat MR has multiple references (and therefore multiple compositional MRs). When computing BLEU scores for the token, tree, and constrained models, we generated one hypothesis for each of the compositional MRs for a single flat MR, and chose the hypothesis with the highest score against all references for that flat MR. We then computed corpus BLEU using these hypotheses. While this isn't an entirely fair way to evaluate these models against the E2E systems, it serves as a sanity check to validate that generation models provided with more semantic information about the references can achieve better BLEU scores against them. For both E2E and weather, we also filtered out, from all model computations, any examples where S2S-CONSTR failed to generate a valid response (5.3).
For human evaluation, we show an overall correctness measure Corr measured on the full test sets, as well as Disc, measured on a more challenging subset of the test set that we selected. For the E2E dataset, we chose examples that contained contrasts by identifying references with a but (230 7 We used the scripts provided at https://github. com/tuetschek/e2e-metrics by the E2E organizers for evaluating both the E2E and the weather models. total). For the weather dataset, we chose 400 examples where the MR has at least one CONTRAST or JUSTIFY. We also included test examples with argument type combinations previously unseen in the training set (313 total); we expect these to be challenging for all models, and in particular for the flat model, which has to infer the right discourse relation for new combinations of arguments. Table 7 shows the results of this experiment. On both the E2E and weather datasets, S2S-CONSTR improves tree accuracy significantly (using Mc-Nemar's chi-squared test) over S2S-TREE. Human evaluation metrics also show that models that are aware of the tree-structured MR (S2S-TREE and S2S-CONSTR) perform significantly better on correctness measures than S2S-TOKEN, which is only aware of the presence or absence of discourse relations, and significantly better than S2S-FLAT, which has no awareness of the structure. The gap is larger on Disc: the flat model gets only 31% of the challenging cases correct on the E2E dataset, while the constrained model's accuracy is more than twice that. A similar gap is evident in the weather dataset.  2019)) and diversity metrics (Section 5.4). We note that for the E2E dataset, the BLEU score increases observed with the tree-based models are not statistically significant compared to S2S-TOKEN. We think this may be partly because many discourse patterns are correlated with the flat MR structure in the E2E dataset (e.g. family-friendly and highly rated are frequently CONTRASTed). By contrast, BLEU score increases are statistically significant for all models on our weather dataset. Also, S2S-CONSTR fails to generate any valid candidates for~1.5% of the weather test examples. In most of these cases, the model stutters, i.e. produces degenerate output like "will be be be . . . ". We suspect that in these cases, the imposed decoding constraints cause the Seq2Seq decoder to get stuck in a pseudoterminal state.

Results
Grammaticality seems to drop slightly for the tree-based models on the weather dataset, but not   on the E2E dataset. One hypothesis from this and the correctness numbers is that the flat models generate more generic (and therefore grammatical), but also incorrect, responses, compared to the tree-based models. We also note that there's a noticeable gap in the E2E dataset between tree accuracy and the correctness numbers from human evaluation. We analyzed 35 examples where our tree accuracy metric disagreed with human evaluation, and found 22 (63%) cases where the compositional MR was missing information in the reference, seemingly due to noise in our automatic annotation process (Section 3.2). We also identified 6 cases (17%) of annotator confusion (for example whether between £20-30 implies the same meaning as moderately priced), sometimes caused by noisy references that contained additional information.
The remaining examples all contained legitimate model errors, like content hallucination, or a wrong slot being produced despite a correct non-terminal. One future direction to get more reliable metrics would be to improve the automatic annotation process in Section 3.2 to eliminate noise and flag noisy references. Further experimentation is described in Appendix E.

Diversity Metrics
We report the diversity metrics used for evaluating E2E challenge submissions in Dušek et al.
(2019) (# unique tokens, # unique trigrams, Shannon token entropy (Manning and Schütze, 1999, p.61ff.), conditional bigram entropy (Manning and Schütze, 1999, p.63ff.)). Table 8 shows these numbers, as compared against a few of the E2E participating systems, TGEN, SLUG, and ADAPT (Elder et al., 2018). All of the models with enriched semantic representations -S2S-TOKEN, S2S-TREE, and S2S-CONSTR -show higher diversity than neural baselines without diversity considerations. Combined with our improved BLEU scores, this seems to indicate that adding discourse relation information to input MRs can increase diversity, without incurring losses on automatic metrics (as is the case with the diversity-promoting ADAPT system).

Data Efficiency and Generalizability
We measured tree accuracy on the full E2E and weather test sets by varying the number of training samples for S2S-TREE and S2S-CONSTR (Figure 2). S2S-CONSTR achieves more than 90% tree accuracy with just 2K samples and more than 95% with 5K samples on both datasets, suggesting that constrained decoding can help achieve superior performance with much less data. Meanwhile, we also investigated the extent to which tree-structured MRs could allow models to generalize to compositional semantics (Figure 3)  6 Related Work Reed et al.'s (2018) approach to enriching the input, discussed earlier, is the most closely related work to ours. A more recent work by Moryossef et al. (2019) also focuses on exercising more control over input structures through sentence plans; however, their work doesn't touch on discourse relations or constrained decoding. Puduppully et al. (2018) builds a modular end-to-end neural architecture that performs content planning in addition to realization, although they focus on generating text from structured tables, and don't consider discourse structure. Also related is Kiddon et al.'s (2016) neural checklist model, which tracks the coverage of an input list of ingredients when generating recipes.
Our constrained decoding approach goes beyond covering a simple list by enforcing constraints on ordering and grouping of tree structures, but theirs takes coverage into account during model training. A more direct inspiration for our approach is the way coverage has been traditionally tracked in grammar-based surface realization (Shieber, 1988;Kay, 1996;Carroll et al., 1999;Carroll and Oepen, 2005;Nakanishi et al., 2005;White, 2006;White and Rajkumar, 2009). Compared to our approach, grammar-based realizers can prevent hallucination entirely, though at the expense of developing an explicit grammar. Constrained decoding in MT (Post and Vilar, 2018, i.a.) has been used to enforce the use of specific words in the output, rather than constraints on tree structures. Also related are neural generators that take Abstract Meaning Representations (AMRs) as input (Konstas et al., 2017, i.a.) rather than flat inputs; these approaches, however, do not generate trees or use constrained decoding.

Conclusions
We show that using rich tree-structured meaning representations can improve expressiveness and semantic correctness in generation. We also propose a constrained decoding technique that leverages tree-structured MRs to exert precise control over the discourse structure and semantic correctness of the generated text. We release a challenging new dataset for the weather domain and an enriched E2E dataset that include tree-structured MRs. Our experiments show that constrained decoding, together with tree-structured MRs, can greatly improve semantic correctness as well as enhance data efficiency and generalizability.

A Weather Forecast Generation
For every example, we extracted the date range requested by the user, and generated artificial weather forecasts for that date range. We generated forecasts of different granularities (hourly or daily) depending on the date requested by the user. If the date that requested was less than 24 hours after the "reference" date in the synthetic user context, we generated hourly forecasts; otherwise, we generated the required number of daily forecasts. To generate forecasts, we selected reasonable mean, standard deviation, min, and max values for temperature and cloud coverage, and used these to sample temperatures for every point in the date range. We also selected random sunrise and sunset times for each day present in the range. We picked values that seemed reasonable, but didn't try too hard to get precise values, since our focus was more on using the forecasts to create complex MRs. We then grouped these acts together using a JOIN discourse relation.
3. Contrast: We identified attributes that were in opposition ("cloudy" vs. "sunny") and added a parent CONTRAST discourse relation to any such dialog acts. We also contrasted related attributes wherever possible; e.g. the cloud coverage value "sunny" can be contrasted with both "cloudy" and the precipitation type "rain".
4. Yes/no questions: Whenever the user query was a boolean one ("Will it rain tomorrow"), we added YES or NO dialog acts as appropriate.
5. Justifications/Recommendations: Whenever the user query mentioned an attire or activity ("Should I wear a raincoat?"), we assumed that the MR should communicate a recommendation as well as a justification for it ("No, you don't need to wear one, it looks like it'll be sunny all day"). In these cases, we added a RECOMMEND dialog act, and an INFORM dialog act that provides the justification for the recommendation. We added a parent JUSTIFY discourse relation to these acts, treating the recommendation as the nucleus and the INFORM as the satellite of the justification.

C Dataset Creation Quality
As mentioned in 3, we asked annotators to evaluate collected responses, and used these to filter out noisy references and annotations from our final dataset. The ratings were made on a 1-5 scale and double annotated, and we filtered out 3,404 examples (out of a total 37,162) that had a score less than 3 on any of the four dimensions: fluency, correctness, naturalness, annotation correctness.

D Data Preprocessing
Infinitely-valued arguments such as names of restaurants, dates, times, and locations such as cities, states are delexicalized (value is replaced by placeholder tokens) in both the input and output of models. This was done following the approach taken by several of the systems in the E2E challenge (Dušek and Jurcıcek, 2016;Juraska et al., 2018;Dušek et al., 2019). The reasoning behind this is that the values of such arguments are often inserted verbatim in the response text, and therefore do not affect the final surface form realization. Replacing these arguments in both the input and output reduces the vocabulary size and prevents sparsity issues. (A copy mechanism, such as the one introduced in Vinyals et al. (2015), can be used to address this, though we did not explore this approach in this work.) The full list of arguments for which we performed delexicalization is: 1. Numerical arguments: temperature-related arguments, precipitation chance, day, month, year (for dates).

E Additional Experiments
We also experimented with a reranked S2S-TREE in which the beam search candidates are reranked for tree accuracy. This yields a tree accuracy of 97.6% and 95.4% on E2E and weather. We trained a Recurrent Neural Network Grammar (RNNG) to tag slots in the prediction of S2S-CONSTR in order to filter out hallucinations. The correctness on filtered test sets rose from 85.89% to 87.44% for E2E, and from 91.82% to 93.84% on weather.

F Human Evaluation of Models
When asking annotators to rate the models on correctness, we asked them to rate the response by comparing it against the reference, rather than against the MR. This adds the risk that annotators are confused by noisy references, but we found that it increased annotation speed and agreement rates significantly over evaluating against the MR directly. This is also because our MRs are treestructured and can be hard to read. We performed double-annotation with a resolution round. Automatic rejection: When analyzing evaluation results, we found that it was fairly easy to miss the absence of a contrast or a justification in our weather dataset, especially since our dataset is so large. As a result, annotators were marking several incorrect cases as correct. To address this issue, we automatically marked as incorrect any examples where the MR had a CONTRAST but the response lacked any contrastive tokens, or where the MR has a JUSTIFY but the response lacked any clear markers of a justification. This eliminated noise from 2.8% of all responses.