Controllable Meaning Representation to Text Generation: Linearization and Data Augmentation Strategies

We study the degree to which neural sequence-to-sequence models exhibit ﬁne-grained controllability when performing natural language generation from a meaning representation. Using two task-oriented dialogue generation benchmarks, we systematically compare the effect of four input linearization strategies on controllability and faithfulness. Additionally, we evaluate how a phrase-based data augmentation method can improve performance. We ﬁnd that properly aligning input sequences during training leads to highly controllable generation, both when training from scratch or when ﬁne-tuning a larger pre-trained model. Data augmentation further improves control on dif-ﬁcult, randomly generated utterance plans.


Introduction
In this work, we study the degree to which neural sequence-to-sequence (S2S) models exhibit finegrained controllability when performing natural language generation (NLG) from a meaning representation (MR). In particular, we focus on an S2S approach that respects the realization ordering constraints of a given utterance plan; such a model can generate utterances whose phrases follow the order of the provided plan.
In non-neural NLG, fine-grained control for planning sentence structure has received extensive study under the names sentence or micro-planning (Reiter and Dale, 2000;Walker et al., 2001;Stone et al., 2003). Contemporary practice, however, eschews modeling at this granularity, instead preferring to train an S2S model to directly map an input MR to a natural language utterance, with the utterance plan determined implicitly by the model which is learned from the training data (Dušek et al., 2020).
We argue that robust and fine grained control in an S2S model is desirable because it enables neural "role-playing", (1) "hack-and-slash", (2) ] ESRB = "M (Mature)" (3) rating = "good" (4) What is it about M rated 3 hack-andslash 2 RPGs 1 that makes you enjoy them? 4 implementations of various psycho-linguistic theories of discourse (e.g., Centering Theory (Grosz et al., 1995), or Accessibility Theory (Ariel, 2001)). This could, in turn, encourage the validation and/or refinement of additional psychologically plausible models of language production.
In this paper, we study controllability in the context of task-oriented dialogue generation (Mairesse et al., 2010;Wen et al., 2015), where the input to the NLG model is an MR consisting of a dialogue act (i.e. a communicative goal) such as to REQUEST EXPLANATION, and an unordered set of attribute-value pairs defining the semantics of the intended utterance (see Figure 1 for an example).
The NLG model is expected to produce an utterance that adequately and faithfully communicates the MR. In the S2S paradigm, the MR must be "linearized" (i.e. represented as a linear sequence of tokens corresponding to the dialogue act and attribute-value pairs) before being presented to the S2S encoder. We explore several linearization strategies and measure their effectiveness for controlling phrase order, as well as their effect on model faithfulness (i.e., the semantic correctness of generated utterances).
Of particular note, alignment training (i.e. at training time, linearizing the attribute-value pairs according to the order in which they are realized by their corresponding reference utterance) produces highly controllable S2S models. While we are not the first to observe this (c.f., Nayak et al. (2017)), we study this behavior extensively. We refer to an ordered sequence of attribute-value pairs x 1 , x 2 , . . . , x n as an utterance plan, and evaluate models on their ability to follow such plans given by either another model, a human, or, most difficultly, from random permutation.
Additionally, we experiment with a data augmentation method, where we create fragmentary MR/utterance pairs obtained from the constituent phrases of the original training data. We find that this data augmentation results in reduced semantic error rates and increases the ability of a model to follow an arbitrary utterance plan.
We summarize our contributions as follows.
(1) We show that alignment training produces highly controllable language generation models, especially when following a model created utterance plan.
(2) We demonstrate that phrasebased data augmentation improves the robustness of the control even on arbitrary and difficult to follow utterance plans. (3) We conclude with a human evaluation that shows that phrase-based data augmentation training can increase the robustness of control without hurting fluency. 1

Methods
In an MR-to-text task, we are given as input an MR µ ∈ M from which to generate an appropriate natural language utterance y ∈ Y, where µ consists of a dialogue act that characterizes the communicative goal of the utterance and an unordered and variably sized set of attribute-value pairs. Attributes are either binary or categorical variables (e.g., family-friendly: ["yes", "no"] or food: ["Chinese", "English", "French", . . .]). 2 Let each attribute-value pair x and dialogue act a be tokens from a vocabulary V, and define the size of an MR, denoted |µ|, to be the number of attribute-value pairs x ∈ µ. A linearization strategy π : M → V * is a mapping of the dialogue act and 1 Code, outputs, augmented data, and other materials can be found here: https://github.com/kedz/ cmr2text.
2 There are also list-valued attributes but we treat them as individual attribute-value pairs (i.e. in Figure 1, both genres = "role-playing" and genres = "hack-and-slash" are in µ).

Linearization Strategies
Because of the recurrence in the GRU and position embeddings in the Transformer, it is usually the case that different linearization strategies, i.e. π(µ) = π (µ), will result in different model internal representations and therefore different conditional probability distributions. These differences can be non-trivial, yielding changes in model behavior with respect to faithfulness and control.
We study four linearization strategies, (i) random, (ii) increasing-frequency, (iii) fixed-position, and (iv) alignment training, which we describe below. For visual examples of each strategy, see Figure 2. Note that linearization determines the order of the attribute-value pairs presented to the S2S encoder, and only in the case of alignment training does it correspond to the order in which the attribute-value pairs are realized in the utterance. When presenting a linearized MR to the model encoder, we always prepend and append distinguished start and stop tokens respectively.

Random (RND)
In the random linearization (RND), we randomly order the attribute-value pairs for a given MR. This strategy serves as a baseline for determining if linearization matters at all for faithfulness. RND is similar to token level noise used in denoising auto-encoders (Wang et al., 2019) and might even improve faithfulness. During training, we resample the ordering for each example at every epoch. We do not resample the validation set in order to obtain stable results for model selection.
Increasing Frequency (IF) In the increasing frequency linearization (IF), we order the attributevalue pairs by increasing frequency of occurrence in the training data (i.e., count(x i ) ≤  count(x i+1 )). We hypothesize that placing frequently occurring items in a consistent location may make it easier for p to realize those items correctly, possibly at the expense of rarer items.
Fixed Position (FP) We take consistency one step further and create a fixed ordering of all attributes, n.b. not attribute-values, ordering them in increasing frequency of occurrence on the training set (i.e. every instance has the same order of attributes in the encoder input). In this fixed position linearization (FP), attributes that are not present in an MR are explicitly represented with an "N/A" value. For list-valued slots, we determine the maximum length list in the training data and create that many repeated slots in the input sequence. This linearization is feasible for datasets with a modest number of unique attributes but may not easily scale to 10s, 100s, or larger attribute vocabularies.
Alignment Training (AT) In the alignment training linearization (AT), during training, the order of attribute-value pairs x 1 , x 2 , . . . , x |µ| matches the order in which they are realized in the corresponding training utterance. This is feasible because in the majority of cases, there is a one-toone mapping of attribute-values and utterance subspans. We obtain this ordering using a manually constructed set of matching rules to identify which utterance subspans correspond to each attribute-π (µ) = [inform, name=Aromi, eat type=coffee shop, area=city centre] y = Aromi is a coffee shop in the city centre.
π (µ) = [inform, eat type=coffee shop, name=Aromi, area=city centre] y = There is a coffee shop called Aromi in the city centre.
Crucially, AT stands in contrast to the first three strategies (RND, IF, and FP) which do not have any correspondence between the the order of attributevalue pairs in π(µ) and the order in which they are realized in the corresponding utterance y.
At test time, when there is no reference utterance AT cannot specify a linearization. However, models trained with AT can generate an utterance from an arbitrary utterance plan x 1 , x 2 , . . . , x |µ| provided by an external source, such as an utterance planner model or human reference. See Figure 3 for an example of how an AT-trained model might follow three different plans for the same MR.

Phrase-based Data Augmentation
We augment the training data with MR/utterance pairs taken from constituent phrases in the original training data. We parse all training utterances and enumerate all constituent phrases governed by NP, VP, ADJP, ADVP, PP, S, or Sbar non-terminals. 3 We then apply the attribute-value matching rules used for AT (see §3.1) to obtain a corresponding MR, keeping the dialogue act of the original utterance. We discard phrases with no realized attributes. See Table 1 for augmented data statistics. Because we reclassify the MR of phrases using the matching rules, the augmented data includes examples of how to invert binary attributes, e.g. from the phrase "is not on Mac," which denotes has mac release = "no," we obtain the phrase "on Mac" which denotes has mac release = "yes." When presenting the linearized MR of phrase examples to the model encoder we prepend and append phrase specific start and stop tokens respectively (e.g., start-NP and stop-NP) to discourage the model from ever producing an incomplete sentence when generating for a complete MR.

Datasets
We run our experiments on two English language, task-oriented dialogue datasets, the E2E Challenge corpus (Novikova et al., 2017)  The ViGGO corpus (5,103 train/246 dev/359 test) contains 14 attribute types and nine dialogue acts. In addition to binary and categorical valued attributes, the corpus also features list-valued attributes (see the genres attribute in Figure 1) which can have a variable number of values, and an openclass specifier attribute (see §A.1 for details).

MR/Utterance Alignments
The original datasets do not have alignments between individual attribute-value pairs and the sub-spans of the utterances they occur in, which we need for the AT linearization strategy. We manually developed a list of heuristic pattern matching rules (e.g. not kid-friendly → family friendly = "no"). For ViGGO, we started from scratch, but for E2E we greatly expanded the rule-set created by Dušek et al. (2019). To ensure the correctness of the rules, we iteratively added new matching rules, ran them on the training and validation sets, and verified that they produced the same MR as was provided in the dataset. This process took one author roughly two weeks to produce approximately 25,000 and 1,500 rules for the E2E and ViGGO datasets respectively. Note that the large number of rules is obtained programmatically, i.e. creating template rules and inserting matching keywords or phrases (e.g., enumerating variants such as not kid-friendly, non kid-friendly, not family-friendly, etc.).
In cases where the matching rules produced different MRs than provided in the original dataset, we manually checked them. In many cases on the E2E dataset and several times on ViGGO, we found the rule to be correct and the MR to be incorrect for the given utterance. In those cases, we used the corrected MRs for training and validation. We do not modify the test sets in any way. Using the matching rules, we can determine alignments between the provided MR and the realized utterances.
For most cases, the attribute-values uniquely correspond to a non-overlapping subspan of the utterance. The rating attribute in the ViGGO dataset, however, could have multiple reasonable mappings to the utterance, so we treat it in practice like an addendum to the dialogue act, occurring directly after the dialogue act as part of a "header" section in any MR linearization strategy (see Figure 2 where rating = "N/A" occurs after the dialogue act regardless of choice of linearization strategy).

Generation Models
We examine the effects of linearization strategy and data augmentation on a bidirectional GRU with attention (biGRU) and Transformer-based S2S models. Hyperparameters were found using grid-search, selecting the model with best validation BLEU (Papineni et al., 2002) score. We performed a separate grid-search for each architecture-linearization strategy pairing in case there was no one best hyperparameter setting.
Additionally, we fine-tune BART (Lewis et al., 2020), a large pretrained Transformer-based S2S model. We stop fine-tuning after validation set cross-entropy stops decreasing.
Complete architecture specification, hyperparameter search space, and validation results for all three models can be found in Appendix A.
Decoding When decoding at test time, we use beam search with a beam size of eight. Beam candidates are ranked by length normalized log likelihood. Similar to Dušek et al. (2019) and Juraska et al. (2019) we rerank the beam output to maximize the F -measure of correctly generated attribute-values using the matching-rules described in §3.1.
For models using the RND linearization, at test time, we sample five random MR orderings and generate beam candidates for each. Reranking is then performed on the union of beam candidates.

Utterance Planner Model
We experiment with three approaches to creating a test-time utterance plan for the AT trained models. The first is a bigram language model (BGUP) over attribute-value sequences. Attribute-value bigram counts are estimated from the training data (using Lidstone smoothing (Chen and Goodman, 1996) with α = 10 −6 ) according to the ordering determined by the matching rules (i.e. the AT ordering).
The second model is a biGRU based S2S model, which we refer to as the neural utterance planner (NUP). We train the NUP to map IF ordered attribute-values to the AT ordering. We grid-search model hyperparameters, selecting the model with highest average Kendall's τ (Kendall, 1938) on the validation set AT orderings. See Appendix B for hyperparameter/model specification details. Unlike the BGUP model, the NUP model also conditions on the dialogue act, so it can learn ordering preferences that differ across dialogue acts.
For both BGUP and NUP, we use beam search (with beam size 32) to generate candidate utterance plans. The beam search is constrained to only generate attribute-value pairs that are given in the supplied MR, and to avoid generating repeated attributes. The search is not allowed to terminate until all attribute-values in the MR are generated. Beam candidates are ranked by log likelihood.
The final ordering we propose is the ORACLE ordering, i.e. the utterance plan implied by the human-authored test-set reference utterances. This plan represents the model performance if it had a priori knowledge of the reference utterance plan. When a test example has multiple references, we select the most frequent ordering in the references, breaking ties according to BGUP log-likelihood.

Test-Set Evaluation
In our first experiment, we compare performance of the proposed models and linearization strategies on the E2E and ViGGO test sets. For the IF and AT+NUP models we also include variants trained on the union of original training data and phraseaugmented data (see §2.2), which we denote +P.
Evaluation Measures For automatic quality measures, we report BLEU and ROUGE-L (Lin, 2004) scores. 4 Additionally, we use the matching rules to automatically annotate the attribute-value spans of the model generated utterances, and then manually verify/correct them. With the attributevalue annotations in hand we compute the number of missing, wrong, or added attribute-values for each model. From these counts, we compute the semantic error rate (SER) (Dušek et al., 2020) where SER = #missing + #wrong + #added #attributes .
On ViGGO, we do not include the rating attribute in this evaluation since we consider it part of the dialogue act. Additionally, for AT variants, we report the order accuracy (OA) as the percentage of generated utterances that correctly follow the provided utterance plan. Utterances with wrong or added attribute values are counted as not following the utterance plan. Additional metrics and SER error break downs can be found in Appendix C. All models are trained five times with different random seeds; we report the mean of all five runs. We report statistical significance using Welch's ttest (Welch, 1947), comparing the score distribution of the five runs from the best linearization strategy against all other strategies at the 0.05 level.
Baselines On the ViGGO dataset we compare to the Transformer baseline of Juraska et al. (2019), which used a beam search of size 10 and heuristic slot reranker (similar to our matching rules).
On the E2E dataset, we report the results of TGen+ (Dušek et al., 2019), an LSTM-based S2S model, which also uses beam search with a matching rule based reranker to select the most semantically correct utterance and is trained on a cleaned version of the corpus (similar to our approach).

Random Permutation Stress Test
Differences between an AT model following a utterance planner model and the human oracle are often small so we do not learn much about the limits of controllability of such models, or how they behave in extreme conditions (i.e. on an arbitrary, random utterance plan, not drawn from the training data distribution). In order to perform such an experiment we generate random utterance plans (i.e. permutations of attribute-values) and have the AT models generate utterances for them, which we evaluate with respect to SER and OA (we lack ground truth references with which to evaluate BLEU or ROUGE-L). We generate random permutations of size 3, 4, . . . , 8 on the E2E dataset, since there are 8 unique attributes on the E2E dataset. For ViGGO we generate permutations of size 3, 4, . . . , 10 (96% of the ViGGO training examples fall within this range). For each size we generated 100 random permutations and all generated plans were given the INFORM dialogue act. In addition to running the AT models on these random permutations, we also compare them to the same model after using the NUP to reorder them into an easier 5 ordering. Example outputs can be found in Appendix D.

Human Evaluation
In our final experiment, we had human evaluators rank the 100 outputs of the size 5 random permutations for three BART models on both datasets: (i) AT+P model with NUP, (ii) AT+P model, and (iii) AT model. The first model, which uses an utterance planner, is likely to be more natural since it doesn't have to follow the random order, so it serves as a ceiling. The second and third models will try to follow the random permutation ordering, and are more likely to produce unnatural transitions between awkard sequences of attribute-values. Differences between these models will allow us to understand how the phrase-augmented data affects the fluency of the models. The annotators were asked to rank outputs by their naturalness/fluency. Each set was annotated twice by different annotators so we can compute agreement. More details can be found in Appendix E.

Results
AT models accurately follow utterance plans. See Table 2 and Table 3 for results on E2E and ViGGO test sets respectively. The best non-ORACLE results are bolded for each model and results that are not different with statistical significance to the best results are underlined. We see that the AT+NUP strategy consistently receives the lowest semantic error rate and highest order accuracy, regardless of architecture or dataset, suggesting that alleviating the model's decoder of content planning is highly beneficial to avoiding errors. The Transformer AT model is able to consistently achieve virtually zero semantic error on E2E using either the bigram or neural planner model.
We also see that fine-tuned BART is able to learn to follow an utterance plan as well. When following the neural utterance planner, BART is highly competitive with the trained from scratch Transformer on E2E and surpassing it on ViGGO in terms of semantic error rate.
Generally, the AT models had a smaller variance in test-set evaluation measures over the five random initializations as compared to the other strategies. This is reflected in some unusual equivalency classes by statistical significance. For example, on the E2E dataset biGRU models, the AT+NUP+P strategy acheives 0% semantic error and is significantly different than all other linearization strategies except the FP strategy even though the absolute difference in score is 6.54%. This is unusual because the AT+NUP+P strategy is significantly different from AT+NUP but the absolute difference is only 0.26%. This happens because the variance in test-set results is higher for FP making it harder to show signficance with only five samples.
Transformer-based models are more faithful than biGRU on RND, FP, and IF linearizations. On the ViGGO dataset, BART and Transformer IF achieve 1.86% and 7.50% semantic error rate respectively, while the biGRU IF model has 19.20% semantic error rate. These trends hold for FP and RND, and on the E2E dataset as well. Because there is no sequential correspondence in the input, it is possible that the recurrence in the biGRU makes it difficult to ignore spurious input ordering effects. Additionally, we see that RND does offer some  benefits of denoising; RND models have lower semantic error rate than IF models in 3 of 6 cases and FP models in 5 out of 6 cases.
Model based plans are easier to follow than human reference plans. On E2E, there is very little difference in semantic error rate when following either the bigram-based utterance planner, BGUP, or neural utterance planner, NUP. This is also true of the ViGGO BART models as well. In the small data (i.e. ViGGO) setting, biGRU and Transformer models achieve better semantic error rate when following the neural utterance planner. In most cases, neural utterance planner models have slightly higher BLEU and ROUGE-L than the bigram utterance planner, suggesting the neural planner produces utterance plans closer to the reference orderings. The neural and bigram planner models have slightly lower semantic error rate than when following the     Random Permutation Stress Test Results of the random permutation experiment are shown in Table 4. Overall, all models have an easier time following the neural utterance planner's reordering of the random permutations. Phrase training also generally improved semantic error rate. All models perform quite well on the E2E permutations. With phrase-training, all E2E models achieve less than 0.6% semantic error rate following random utterance plans. Starker differences emerge on the ViGGO dataset. The biGRU+NUP+P model achieves a 8.98% semantic error rate and only correctly follows the given order 64.5% of the time, which is a large decrease in performance compared to the ViGGO test set.
Human Evaluation Results of the human evaluation are shown in Table 5. We show the number of times each system was ranked 1 (most natural), 2, or 3 (least natural) and the average rank overall.
Overall, we see that BART with the neural utterance planner and phrase-augmentation training is preferred on both datasets, suggesting that the utterance planner is producing natural orderings of the attribute-values, and the model can generate reasonable output for it. On the E2E dataset, we also see small differences in between the AT+P and AT models suggesting that when following an arbitrary ordering, the phrase-augmented model is about as natural as the non-phrase trained model. This is encouraging as the phrase trained model has lower semantic error rates. On the ViGGO dataset we do find that the phrase trained model is less natural, suggesting that in the small data setting, phrasetraining may hurt fluency when trying to follow a difficult utterance plan.
For agreement we compute average Kendall's τ between each pair of annotators for each dataset. On E2E, we have τ = .853 and ViGGO we have τ = .932 suggesting very strong agreement.

Discussion
One consistently worrying sign throughout the first two experiments is that the automatic metrics are not good indicators of semantic correctness. For example the ROUGE-L score of the E2E AT ORACLE models is about 8 points higher than the AT+NUP models, but the AT+NUP models make fewer semantic errors. Other similar examples can be found where the automatic metric would suggest picking the more error prone model over another. As generating fluent text becomes less of a difficult a problem, these shallow ngram overlap methods will cease to suffice as distinguishing criteria.
The second experiments also reveal limitations in the controllable model's ability to follow arbitrary orderings. The biGRU and Transformer models in the small-data ViGGO setting are not able to generalize effectively on non-training distribution utterance plans. BART performance is much better here, but is still hovering around 2% semantic error rate and only roughly 88% of outputs conform to the intended utterance plan. Thankfully, if an exact ordering is not required, using the neural utterance planner to propose an order leads to more semantically correct outputs.

Limitations
While we are able to acheive very low test-set SER for both corpora, we should caution that this required extensive manual development of matching rules to produce MR/utterance alignments, which in turn resulted in significant cleaning of the training datasets. We chose to do this over pursuing a model based strategy of aligning utterance subspans to attribute-values because we wanted to better understand how systematically S2S models can represent arbitray order permutations independent of alignment model error.
Also we should note that data cleaning can yield more substantial decreases in semantic errors (Dušek et al., 2019;Wang, 2019) and is an important consideration in any practical neural NLG. They also find that properly aligned linearization can lead to a controllable generator. These papers do not, however, explore how other linearization strategies compare in terms of faithfulness, and they do not evaluate the degree to which a S2S model can follow realization orders not drawn from the training distribution.
Castro Ferreira et al. (2017) compare a S2S NLG model using various linearizations of abstract meaning representation (AMR) graphs, including a model-based alignment very similar to the AT linearization presented in this work. However, they evaluate only on automatic quality measures and do not explicitly measure the semantic correctness of the generated text or the degree to which the model realizes the text in the order implied by the linearized input.
Works like Moryossef et al. (2019a,b) and Castro Ferreira et al. (2019) show that treating various planning tasks as separate components in a pipeline, where the components themselves are implemented with neural models, improves the overall quality and semantic correctness of generated utterances relative to a completely end-to-end neural NLG model. However, they do not test the systematicty of the neural generation components, i.e. the ability to perform correctly when given an arbitrary or random input from the preceding component, as we do here with the random permutation stress test.
Other papers mention linearization order anecdottally but do quantify its impact. For example, Juraska et al. (2018) experiment with random linearization orderings during development, but do not use them in the final model or report results using them, and Gehrmann et al. (2018) report that using a consistent linearization strategy worked best for their models but do not specify the exact order. Juraska et al. (2018) also used sentence level data augmentation, i.e. splitting a multi-sentence example in multiple single sentence examples, similar in spirit to our proposed phrase based method, but they do not evaluate its effect independently.

Conclusion
We present an empirical study on the effects of linearization order and phrase based data augmentation on controllable MR-to-text generation. Our findings support the importance of aligned linearization and phrase training for improving model control. Additionally, we identify limitations to this ability, specifically in the small data, random permutation setting, and will focus on this going forward.

A.1 General Details
Utterance text was sentence and word tokenized, and all tokens were lower-cased. A special sentence-boundary token was inserted between sentences. All words occurring fewer than 3 times on the training set were replaced with a special unknown token. We used a batch size of 128 for all biGRU and Transformer models. All models were trained on a single Nvidia Tesla v100 for at most 700 epochs.
Delexicalization The ViGGO corpus is relatively small and the attributes name, developer, release year, expected release date, and specifier can have values that are only seen several times during training. Neural models often struggle to learn good representations for infrequent inputs, which can, in turn, lead to poor test-set generalization. To alleviate this, we delexicalize these values in the utterance. That is, we replace them with an attribute specific placeholder token. Additionally, for specifier whose values come from the open class of adjectives, we represent the specified adjective with a placeholder which marks two features, whether it is consonant (C) or vowel initial (V) (e.g. "dull" vs. "old") and whether it is in regular (R) or superlative (S) form (e.g. "dull" vs. "dullest") since these features can effect the surrounding context in which the adjective is realized. See the following lexicalized/delexicalized examples: • specifier = "oldest" -vowel initial, superlative -What is the oldest game you've played? -What is the SPECIFIER V S game you've played?
• specifier = "old" -vowel initial, regular -What is an old game you've played? -What is an SPECIFIER V R game you've played?
• specifier = "new" -consonant initial, regular -What is a new game you've played? -What is a SPECIFIER C R game you've played?
All generated delexicalized utterances are postprocessed with the corresponding attribute-values before computing evaluation metrics (i.e., they are re-lexicalized with the appropriate value strings from the input MR).

A.2 biGRU Model Definition
Let V be the encoder input vocabulary, and E ∈ R |V|×Dw an associated word embedding matrix where E x ∈ R Dw denotes the D w -dimensional embedding for each x ∈ V. Given a linearized MR π(µ) = x = a, x 1 , x 2 , . . . , x |µ| ∈ V m where the length of the sequence is m = |µ|+1, The hidden states of the first GRU encoder layer are computed aŝ (1) are the forward and backward encoder GRU parameters.
When using a two layer GRU, we similarly computê i ∈ R 2D h , andη (2) andη (2) are the forward and backward encoder GRU parameters for the second layer.
Going forward, let h i correspond to the final encoder output, i.e. h i = h (1) i in the one-layer biGRU case, and h i = h (2) i in the two layer case. Let W be the vocabulary of utterance tokens, and D ∈ R |W|×Dw an associated embedding matrix, where D y ∈ R Dw denotes a D w -dimensional embedding for each y ∈ W.
Given the decoder input sequence y = y 1 , y 2 , . . . , y |y| , let w i = D y i for i ∈ {1, . . . n}. where n = |y| − 1 We compute the hidden states of the i-th layer  of the decoder as, and ζ (i) are the decoder GRU parameters.
Going forward, let g i correspond to the final decoder output, i.e. g i = g (1) i in the one-layer biGRU case, and g i = g (2) i in the two layer case. Then the decoder states attend to the encoder states,h where α i,j ∈ (0, 1) is the attention weight of decoder state i on encoder state j and m j=1 α i,j = 1. We compute attention in one of two ways (the attention method is a hyperparemeter option): 1. Feed-forward "Bahdanau" style attention (Bahdanau et al., 2015), also known as "concat" (Luong et al., 2015): 2. "general" (Luong et al., 2015) : Finally, for i ∈ 1, . . . , n we compute and p (y i+1 |y ≤i , π(µ)) = As a hyperparamter setting, we consider tieing the decoder input and output embedding matrices, i.e. D = W (o) . Dropout of 0.1 is applied to all embedding, GRU outputs, and linear layer outputs. We set D w = D h = 512.

A.3 biGRU Hyperparameter Search
We grid-search over the following hyperparameter values: • During hyperparameter search, we train for at most 500 epochs, evaluating BLEU every 25 epochs to select the best model. We decay the learning if validation log-likelihood stops increasing for five epochs. We decay the learning rate by lr i+1 = .99 × lr i .
Winning hyperparameter settings are presented Table 6.

A.4 Transformer Model Definition
Each Transformer layer is divided into blocks which each have three parts, (i) layer norm, (ii) feed-forward/attention, and (iii) skip-connection. We first define the components used in the transformer blocks before describing the overall S2S transformer. Starting with layer norm (Ba et al., 2016), let H ∈ R m×n , then we have LN : where a, b ∈ R n are learned parameters, is the elementwise product, A = [a, . . . , a] ∈ R m×n is a tiling of the parameter vector, a, m times, and µ, Λ ∈ R m×n are defined elementwise as respectively. The term is a small constant for numerical stability, set to 10 −5 . The inplace feed-forward layer, FF, is a simple single-layer perceptron with ReLU activation (ReLU(H) = max (0, H)) (Nair and Hinton, 2010), applied to each row of an m × n input matrix, i.e. a sequence of m objects with n features, are learned parameters and matrix-vector additions (i.e. X + b) are broadcast across the matrix rows.
The final component to be defined is the multi-head attention, MultiAttn which is defined as where [·] indicates column-wise concatenation, W (a 1 ) 1, * ∈ R Dw×Dw/H and W (a 2 ) ∈ R Dw×Dw are learned parameters, H is the number of attention heads, and Attn is defined, Additionally, there is a masked variant of attention, MultiAttn M where the attention is computed where M ∈ R n×m is a lower triangular matrix, i.e. values on or below the diagonal are 1 and all other values are −∞. Given these definitions, we now define the S2S transformer. Let V be the encoder input vocabulary, and E ∈ R |V|×Dw an associated word embedding matrix where E x ∈ R Dw denotes the D wdimensional embedding for each x ∈ V. Given a linearized MR π(µ) = x = a, x 1 , x 2 , . . . , x |µ| ∈ V m where the length of the sequence is m = |µ| + 1, Additionally let P ∈ R mmax×Dw be a sinusoidal position embedding matrix defined elementwise with P i,2j = sin i 10, 000 2j Dw P i,2j+1 = cos i 10, 000 2j Dw . The encoder input sequence H (0) ∈ R m×Dw is then defined by A sequence of l transformer encoder layers are then applied to the encoder input, i.e. H (i+1) = TF (i) enc H (i) . Each encoder transformer layer computes the following, We denote the final encoder output for l layers as H = H (l) .
Let W be the vocabulary of utterance tokens, and D ∈ R |W|×Dw an associated embedding matrix, where D y ∈ R Dw denotes a D w -dimensional embedding for each y ∈ W.
A sequence of l transformer decoder layers are then applied to the decoder input, i.e. G (i+1) = TF (i) dec G (i) . Each decoder transformer layer computes the following, (Masked Self-Attention Block) Let the G = G (l) denote the final decoder output, and let g i be the i-th row of G corresponding to the decoder representation of the i-th decoder state. The probability of the next word is p (y i+1 |y ≤i , π(µ)) The input embedding dimension is D w = 512 and inner hidden layer size is D h = 2048. The encoder and decoder have separate parameters. We used H = 8 heads in all multi-head attention layers. We used Adam with the learning rate schedule provided in Rush (2018) (factor=1, warmup=8000). Dropout was set to 0.1 was applied to input embeddings and each skip connection (i.e. the third line in each block definition). As a hyperparameter, we optionally tie the decoder input and output embeddings, i.e. D = W (o) .

A.5 Transformer Hyperparameter Search
We grid searched over the following Transformer hyper-parameters: • Tied Decoder Embeddings: tied, untied

A.6 BART Model Hyperparameters
We use the same settings as the fine-tuning for the CNN-DailyMail summarization task, although we modify the maximum number of updates to be roughly to be equivalent to 10 epochs on the training set when using a 500 token batch size, since the number of updates effects the learning rate scheduler. We selected the model iterate with lowest validation set cross-entropy.
While BART is unlikely to have seen any linearized MR in its pretraining data, its use of subword encoding allows it to encode arbitrary strings. Rather than extending it's encoder input vocabulary to add the MR tokens, we simply format the input MR as a string (in the correpsonding linearization order), e.g. "inform rating=good name=NAME platforms=PC platforms=Xbox".

A.7 Validation Results
Validation set results are shown in Table 8 and Table 9 for the E2E and ViGGO datasets respectively. Unlike the test results, reported in the main paper and appendix, validation SER and OA are computed automatically and not manually validated. All results are the average of five random initializations. Also we use the corrected MR produced by our attribute-value matching rules as input, rather than the original validation set MR.

C Expanded Test Set Results
We show full automatic evaluation metrics from the E2E official evaluation script. E2E and ViGGO results are shown in Table 11 and Table 12 respectively. We also show full manual semantic evaluation results in Table 13 and Table 14 for E2E and ViGGO respectively. We break out the counts of missing, wrong, and added attributes used for SER calculation. Wrong attributes occur when an attribute is realized with the wrong value. Added attribute indicate the model realized an attributevalue that was not given in the input MR. Repeated attributes, even when specified in the input MR are included in added counts. We also include the percentage of utterances with correct semantics regardless of order (Perf.