Template Guided Text Generation for Task Oriented Dialogue

Virtual assistants such as Google Assistant, Amazon Alexa, and Apple Siri enable users to interact with a large number of services and APIs on the web using natural language. In this work, we investigate two methods for Natural Language Generation (NLG) using a single domain-independent model across a large number of APIs. First, we propose a schema-guided approach which conditions the generation on a schema describing the API in natu-ral language. Our second method investigates the use of a small number of templates, growing linearly in number of slots, to convey the semantics of the API. To generate utterances for an arbitrary slot combination, a few simple templates are ﬁrst concatenated to give a semantically correct, but possibly incoherent and ungrammatical utterance. A pre-trained language model is subsequently employed to rewrite it into coherent, natural sounding text. Through automatic metrics and human evaluation, we show that our method improves over strong baselines, is robust to out-of-domain inputs and shows improved sample efﬁciency. 1


Introduction
Virtual assistants have become popular in recent years and task-completion is one of their most important aspects. These assistants help users in accomplishing tasks such as finding restaurants, buying sports tickets, finding the weather etc., by providing a natural language interface to many services or APIs available on the web. Most systems include a natural language understanding and dialogue state tracking module for semantic parsing of the dialogue history. This is followed by a policy module which interacts with the APIs, whenever required, and generates the actions to be taken by the system to continue the dialog. In the end, the Natural Language Generation (NLG) module converts these actions into an utterance, which is surfaced to the user. Being the user-facing interface of the dialogue system, NLG is one of the most important components impacting user experience.
Traditional NLG systems heavily utilize a set of templates to produce system utterances. Although the use of templates gives good control over the outputs generated by the system, defining templates becomes increasingly tedious as more APIs are added. Supporting multi-domain conversations spanning multiple APIs quickly grows out of hand, requiring expert linguists and rigorous testing to ensure the grammatical correctness and appropriateness of generated utterances. Consequently, data-driven generative approaches have gained prominence. Such systems require much less effort and can generate utterances containing novel patterns. Meanwhile, with the rapid proliferation of personal assistants, supporting large number of APIs across multiple domains has become increasingly important, resulting in research on supporting new APIs with few labelled examples (few-shot learning). To this end, generative models pre-trained on large amounts of unannotated text have been increasingly successful.
In this work, we address the challenges of joint modeling across a large number of domains, and data efficient generalization to new domains and APIs for NLG. Our contributions are the following: 1. We propose two methods for zero-shot and few-shot NLG. Our first method, the Schema-Guided NLG, represents slots using their natural language descriptions. Our second method -Template Guided Text Generation (T2G2) employs a simple template-based representation of system actions and formulates NLG as an utterance rewriting task ( Figure 1). Figure 1: Overall architecture of our proposed template guided approach. 1. The policy module outputs a set of actions in response to the user utterance. 2. Simple templates convert each action into a natural language utterance. 3. Template-generated utterances are concatenated and fed to a T5 encoder-decoder model (Raffel et al., 2020). The model rewrites it to a conversational response surfaced to the user.
2. We present the first NLG results on the Schema-Guided dialogue dataset (Rastogi et al., 2019), which exceeds all other datasets in scale, providing a total of 45 APIs over 20 domains. While the current state-of-the-art pre-training based methods struggle to generalize to unseen (zero-shot) APIs, our proposed methods are robust to out-of-domain inputs and display improved sample efficiency.
3. We conduct an extensive set of experiments to investigate the role of dialogue history context, cross-domain transfer learning and few-shot learning. We share our findings to guide the design choices in future research.

Related Work
Natural language generation from structured input (NLG) has been an active area of research, facilitated by creation of datasets like WikiBio (Lebret et al., 2016), E2E challenge (Novikova et al., 2017), WebNLG (Gardent et al., 2017) and MultiWOZ (Budzianowski et al., 2018). Neural sequence models have been extensively used in a variety of configurations for NLG in dialogue systems. Wen et al. (2017) proposed a two-step approach: first generating a delexicalized utterance with placeholders for slots and then post-processing it to replace placeholders with values from API results, whereas Nayak et al. (2017) highlighted the importance of conditioning responses on slot values. Sequence to sequence architectures directly converting a sequential representation of system ac-tions to a system response are also very common (Wen et al., 2015;Du sek and Jurcicek, 2016b;Zhu et al., 2019;Chen et al., 2019). Domainadaptation and transfer learning in low resource settings has also been an extensively studied problem (Tran and Le Nguyen, 2018;Chen et al., 2020;Peng et al., 2020;Mi et al., 2019), with recently released datasets like SGD (Rastogi et al., 2019) and FewShotWOZ (Peng et al., 2020) providing a good benchmark. Meanwhile, language models pre-trained on large amount of unannotated text corpus have achieved state-of-the-art performance across several natural language processing tasks (Devlin et al., 2019;Yang et al., 2019;Liu et al., 2019;Radford et al., 2019;Keskar et al., 2019), including natural language generation (Peng et al., 2020;Kale and Roy, 2020).
Our template based approach bears similarities to sentence fusion (Barzilay and McKeown, 2005), and prototype based text editing (Hossain et al., 2020;Cao et al., 2018;Guu et al., 2018;Wu et al., 2019). However, none of these works tackle text generation from structured data.

For a given system dialogue turn, let
be the set of actions which are produced by the system, where A is the total number of actions for this turn. Each action consists of a single dialogue act d i representing the semantics of the action, along with optional slot and value parameters -s i and v i respectively. For example, inform, req more and request are some of the dialogue acts

Approach
Representation of System Actions Naive inform ( restaurant = Opa! ) inform ( cuisine = greek ) Schema Guided inform ( name of restaurant = Opa! ) inform ( type of food served = greek ) Template Guided How about the restaurant Opa!. The restaurant serves greek food. Ground Truth Opa! is a nice greek restaurant. How does it sound? Figure 2: An example showing the representation of system actions utilized by the three schemes. The template representation is generated by concatenating sentences obtained from two templates, which are "inform(restaurant = $x) → How about the restaurant $x." and "inform(cuisine = $x) → The restaurant serves $x food.". defined in the SGD dataset (Rastogi et al., 2019), which are used for informing the value of a slot to the user, asking if the user needs some other help, and requesting the value of a slot from the user respectively. Some acts like inform require both the slot and value parameters, whereas acts like request require the slot parameter only and acts like req more require none. Some datasets allow multiple slot-value arguments for a single act, but such actions can generally be converted to the above representation by decomposing them into multiple actions with the same act, each containing exactly one slot-value pair. The goal of NLG is to translate A to a natural language response with the same semantic content. To this end, we first convert the set A into a sequence. Then, we finetune a Text-to-Text Transfer Transformer (T5) (Raffel et al., 2020) model, which is a pre-trained sequence to sequence transformer, to generate the natural language response using this sequence as input. Now, we present three different methods for converting A into a sequence, the last two being our contributions. They are also summarized in Figure 2.

Naive Representation
This approach uses the most basic representation of actions, similar to that used in many prior works (Novikova et al., 2017;Zhu et al., 2019;Peng et al., 2020). Canonical representations of each action a i , a i (s i ) or a i (s i = v i ), depending on the parameters present in the action, are concatenated together to obtain a sequence representation of A. Although this representation is simple to obtain and gives state of the art results for several data-to-text benchmarks (Kale and Rastogi, 2020), it suffers from two drawbacks -(i) Semantics -This representation doesn't convey much information about the semantics of a slot. Consequently, the model may need a larger number of training examples to identify the semantics of a slot from its usage in the system utterances in the training data.
(ii) Representation Bias -This representation is very different from what the encoder has seen during pre-training phase, which is natural language text. As a result, the representations learnt during pre-training may not transfer well. Peng et al. (2020) mitigate this by conducting additional pre-training using large scale annotated dialogue datasets. While this method is effective, a large in-domain corpus may not always be available.

Schema Guided Representation
Recent work on low-resource natural language understanding tasks have used natural language descriptions of slots. These descriptions are easy to obtain, directly encode the semantics of the slot and have been shown to help when in-domain training data is sparse. While description based representations have become popular for tasks like spoken language understanding (Bapna et al., 2017) and dialogue state tracking (Rastogi et al., 2019), they have not yet been applied to the language generation task. We propose an extension of the Naive representation by replacing the slot names with their natural language descriptions. The action representations, as illustrated in Figure 2, are a i , a i (desc(s i )) and a i (desc(s i ) = v i ), where desc(s) represents a natural language description of slot s. This solves the first drawback of the Naive representation mentioned above.

Template Guided Representation
We solve the representation bias problem by converting the set of actions output by the system into a natural language utterance. We employ a technique similar to that used in Rastogi et al. (2019), where simple utterances are generated using a minimal set of manually defined templates. Specifically, as shown in Figure 3, we define one template for each Action Template notify success Your ride is booked and the cab is on its way. goodbye Have a safe ride! request(dest) Where are you riding to? request(shared) Are you comfortable sharing the ride? confirm(dest=$x) You are going to $x. inform(fare=$x) Your ride costs $x dollars. inform(seats=$x) The cab is for $x riders. Note that, our focus here is not to generate conversational and grammatically correct utterances, but to have a simple representation of the actions, which can be rewritten by the model into a natural and fluent response. Hence, we do not need to cover all edge cases typically required in template based methods -handling of plurals, subject-verb agreement, morphological inflection etc. -and only need to define a small number of templates. For most APIs, this amounts to around 15-30 templates, which can easily be written by the API developer. The actual number varies depending on the number of slots and intents supported by the API 2 . Some special slots like date, time and price are formatted using special rules, which can be reused across APIs. For instance, we convert the date "2019-03-06" to "6th March", the time "18:40" to "6:40 pm", and price "60" to "$60". We call this step value paraphrasing. Since this method relies on a combination of templates and transfer learning from language models, we name it Template Guided Text Generation (T2G2).

Experimental Setup
We conduct a series of experiments to compare the three system action representations presented above. We also evaluate NLG in few-shot settings and investigate a few other aspects of the SGD dataset. In each of the experiments reported in this  paper, we start with a pre-trained T5-small model 3 . It has 6 layers each in the encoder and decoder, with a total of around 60 million parameters. The model is then fine-tuned on the corresponding dataset using a constant learning rate of 0.001 and batch size of 256 for 5000 steps. The checkpoint yielding the highest BLEU score on the development set is picked for reporting test set results. During inference, we use beam search with a width of 4 and length penalty α = 0.6.

Action Representations
We  Figure 4) makes it representative of practical scale-related challenges faced by today's virtual assistants. Furthermore, as opposed to the other two datasets, its evaluation sets contain many domains, and consequently slots, which are not present in the training set. Even for domains shared between the training and evaluation sets, the evaluation sets contain additional slots in some cases. This focus on zero-shot generalization to new domains and APIs makes SGD more challenging than existing NLG benchmarks. Table 1 compares these datasets.

Automatic Evaluation
Following prior work (Wen et al., 2015), we use BLEU (Papineni et al., 2002) and Slot Error Rate   (SER) (Dušek and Jurcicek, 2019) as automatic metrics. SER represents the fraction of generated texts where at least one slot was not correctly copied from the structured data. Since this metric relies on string matching, we cannot use it to evaluate binary slots like has live music. Its exact match nature also prevents it from identifying paraphrases of slot values, e.g. expensive and costly. For E2E we use additional metrics used in prior work for this benchmark -NIST (Doddington, 2002), ROUGE-L (Lin, 2004), METEOR (Lavie and Agarwal, 2007), CIDEr (Vedantam et al., 2015), and BLEU. Table 2 lists results on the MultiWOZ and Table 3   MultiWOZ is slightly worse in comparison with SC-GPT. SC-GPT generates 5 predictions for each input and then ranks them based on the SER score itself. On the other hand, we generate a single output, on which SER is evaluated. Overall, the results indicate that with enough annotated data, the Naive approach is enough to attain good performance. Both datasets are large and feature limited variety (MultiWOZ has 57K utterances spread over just 5 domains, while E2E has 33k utterances spread over just 8 slots). Zero-shot and few-shot settings offer a greater and more realistic challenge, and we explore these settings next. The SGD dataset, which spans 20 domains, enables us to study these settings.  Adaptation to New Domains The ideal NLG model should be able to handle domains it was not exposed to during training. The SGD dataset, which features unseen domains in the evaluation sets, lets us us assess the zero-shot capability of NLG systems. We report results in Table 4 on two test sets -the seen set consists of domains that were seen during training, while the unseen set consists of brand new domains aka the zeroshot setting. Firstly, all models exhibit low SER scores in both seen and unseen domains, with the template approach being the lowest. This suggests that pre-trained language models are adept at copying and this skill also generalizes to out-of-domain examples.

BLEU
The Schema-Guided representation performs at par with Naive representation on seen domains. At the same time, the slot descriptions do improve performance on the unseen domains (+0.9 BLEU), albeit to a limited degree. More effective ways of incorporating descriptions is a promising area for future work. For the seen domains, T2G2 outperforms Naive by 1.7 BLEU. The results on the

Model
Input or Generated sequence

Actions
Flights offer(airlines=American Airlines, outbound departure time=2:40 pm, is nonstop=True, price=$78) Template Would you like to fly with American Airlines? The onward flight takes off at 2:40 pm. It is a direct flight. The ticket costs $78.

Reference
Do you wish to fly on American Airlines? The onward direct flight leaves at 2:40 pm, and the ticket price is $78. The ticket costs $78 Naive There is an American Airlines flight that leaves at 2:40 pm and is not stop. The ticket costs $78.

Schema
There is an American Airlines flight that leaves at 2:40 pm and is not stop. The total cost is $78.

T2G2
How about an American Airlines flight that leaves at 2:40 pm? It's a direct flight and costs $78.

Actions
RentalCars confirm(has insurance=True, car type=hatchback) Template Please confirm the following details: It is a Hatchback car. It comes with insurance.

Reference
Please confirm: You want me to book a Hatchback car with insurance. Naive Please confirm: You want me to reserve a Hatchback car with a reassurance.

Schema
Please confirm: You want to rent a Hatchback.

T2G2
Okay, please confirm that you'd like to have a Hatchback car with an insurance.

Media inform(starring=Andrea Martin) Template
Antonio Bustorff acted in it.

Reference
The actor is Antonio Bustorff. Naive The movie is directed by Antonio Bustorff.

Schema
The movie is directed by Antonio Bustorff.

T2G2
Antonio Bustorff acted in the movie. unseen domains are more striking with an improvement of 7.3 points. This confirms the hypothesis that our simple template based input scheme offers superior generalization capabilities with a low overhead. The template model learns to "fuse" sentences and is able to successfully extend this skill to unseen domains.

Qualitative Analysis
In Figure 5 we list a few examples of model predictions. The first example illustrates a case where the model has to deal with a seen domain Flights but an unseen slot is nonstop. Such a case would be common when new functionality needs to be added to an existing domain. Both Naive and Schema are unable to verbalize the slot correctly. While the template input contains all the information, it sounds very robotic. T2G2, on the other hand, takes the 4 template sentences as input and rewrites them into a fully accurate but much more natural sounding response. The next example is from RentalCars, and features an unseen slot has insurance. Schema fails to mention this slot. Naive attempts to verbalize it, but uses the wrong word (reassurance). T2G2, however, is able to paraphrase the template input into grammatical text without dropping any information.
The final example features an unseen slot starring from the Movies domain. Naive and Schema treat Antonio Bustroff as a director, since the slot di-rected by appears during training. However, T2G2 simply relies on the template input and copies the phrase acted in. We refer the reader to Appendix F for more qualitative examples.

Human Evaluation
We conduct a human evaluation study via crowd sourcing 4 . Each human rater is shown the responses generated by different models and the ground truth response in a random order. Following (Peng et al., 2020), they are asked to rate each response on a scale of 1 (bad) to 3 (good) along two axes -informativeness and naturalness. Informativeness quantifies whether the response contains all the information contained in the dialogue acts, whereas naturalness evaluates whether the response sounds coherent, grammatical and natural. Each example is rated by 3 different workers. The final metric is an average of all the ratings.
A total of 500 randomly chosen examples are rated -250 each from seen and unseen domainsacross the 3 models discussed above and the ground truth response (human). With 3 ratings per example, this leads to a total of 6,000 ratings. Results are shown in Table 5. Naturalness On the overall test set, all models outperform the human authored ground truth. This showcases the strength of pre-trained language models in generating natural sounding utterances, echoing findings from prior works. (Radford et al.,

2019; Peng et al., 2020).
Informativeness Simply generating a fluent response is not enough. Its paramount for the responses to be factually grounded in the structured data, so that the wrong information is not conveyed to the user. For informativeness, we notice that all models perform well on the seen domains. However, on unseen domains, the Naive approach fares poorly. Schema outperforms Naive by a large margin on unseen domains. T2G2 further improves upon Schema. These results suggest Schema and T2G2 offer promising avenues to improve the zero-shot generalization capability of NLG systems. Moreover, both Naive and Schema see large drops on unseen domains, while T2G2 performs equally well on both seen and unseen domains.
Recall that Naive representation demonstrated strong scores on the SER metric for unseen domains. However, the low human scores on informativeness suggest that getting perfect scores on metrics like SER may not be a reliable way to judge factual accuracy. As models become stronger, better evaluation metrics need to be developed to accurately measure the improvements.

Few-Shot NLG
Virtual assistants need to support a constantly increasing number of domains and APIs. In order to keep labelled data costs under control, improving few-shot learning methods is important. In this section, we study the trade-off between the number of annotated training examples and performance of NLG.  Prior work (Mi et al., 2019;Tran and Le Nguyen, 2018;Wen et al., 2016) has studied few-shot learning and domain adaptation in a simulated setting by creating small subsets. However, lack of knowledge of the exact data splits makes it difficult to make comparisons to other methods. To remedy this, we create a new canonical split of the SGD dataset as described below.
• We make K-shot subsets for varying values of K [5, 10, 20, 40, 80]. In this setting each of the 14 domains from the training set have K dialogs.
• For all the few-shot splits we make sure that they contain examples for every dialogue act and slot type present in the full training set. For every domain, we make sure that each dialog act (inform, request etc.) and slot (name, time, price etc.) is represented at least once. However, all combinations of dialog acts and slots may not exist.
• The dev and test sets are left untouched.
This benchmark is referred to as FewShotSGD and we make the exact splits publicly available. The exact number of examples in each split is given in Table 6.

Results
In few shot experiments, we examine the performance of different models as a function of the amount of labelled data. The training setup remains the same, as described in section 4. Results are reported in Figure 6, where we can clearly see the performance improving as more training data becomes available. In all the K-shot settings, T2G2 gives consistent improvements of 4-5 BLEU while reducing the SER by a large margin. Even in the extreme 5-shot setting, the SER is just 3.6%. Remarkably, T2G2 in the 80-shot setting outperforms the Naive model trained on the entire dataset, which is 20x larger. In the 5-shot setting, T2G2 performs on par with 80-shot Naive. We take this as evidence that our template guided input representation can lead to significant reduction in labelled data requirements.

Other Experiments
In this section, we conduct experiments to explore a few other aspects of our setup on the SGD dataset.
For these experiments we use the Naive representation, since it is more widely adopted in prior work. We hope that these experiments will guide design choices in the future NLG models.

Joint Modeling
Joint modeling, instead of domain specific models, could be beneficial in low resource settings if there is some similarity between the underlying structure. Furthermore, having a single model for all domains also reduces the maintenance workload and is resource efficient. For NLG systems, it could also help in maintaining consistent styles across domains and APIs.
Because of these merits, we investigate the effect of joint modeling on SGD dataset. We focus on the 12 domains that are present in all 3 splits -train, dev and test. We train a single model on   all these domains and compare it with individual models trained for each domain separately. As shown in Table 7, joint modeling leads to a winwin situation by improving BLEU by 3.4 points and reducing SER from 4.7% to just 1%, while requiring fewer parameters and resources. For further analysis of transfer learning across domains, we refer the reader to Appendix C.

Role of Context
Dialogue acts represent the semantic content of the system response, but they don't contain any information about the lexical and syntactic content. The previous utterances in the dialogue history or context are important for generating good responses because they can help model conversational phenomena such as co-reference, elision, entrainment (lexical and syntactic alignment of responses) and avoid repetition (Du sek and Jurcicek, 2016a). Context also helps add variations to the responses generated across different conversations for the same system actions. Table 8 shows the performance of NLG as more utterances from the dialogue context are given as input. In these experiments, we concatenate the last k utterances to the system action representation obtained from the Naive method. The model benefits from the additional context, showing an improvement of upto 6 BLEU. Just a single context utterance -the previous user utterance -results in an improvement of nearly 3 BLEU.
The evaluation for k >= 2 is not completely realistic, because we used the ground truth system utterances in the context during evaluation as opposed to the utterances generated by the NLG model itself. Regardless, the improvements clearly point to effectiveness of the added context at the cost of more resources. We hope these results inspire more work in this exciting direction.

Conclusion and Future Work
In this work, we proposed schema guided and template guided input representation schemes for task oriented response generation. Coupled with pretrained language models, the template guided approach enables zero-shot generalization to new domains with little effort. Moreover, we show that it can lead to drastic reduction in annotation costs. We also present the first set of results on the multidomain SGD dataset, which we hope will pave the way for further research in few-shot, zero-shot and multi-domain language generation.
While in this paper we use standard pre-trained models, designing pre-training tasks tailored to sentence fusion is an interesting line of future work. We also hope to apply T2G2 to languages other than English. Obtaining annotated data in non-English languages is an even bigger challenge, making the sample efficiency of our template rewriting approach especially suited to this setting. Another interesting line of future work is to investigate the use of T2G2 for generating user utterances, which could be useful for dialogue data augmentation and user simulation. This requires adding the ability to generate utterances with stylistic variations to capture different user personalities while maintaining consistency in style and vocabulary over a single dialogue.

A Additional Experiment Details
All models are trained on a 4x4 TPU slice, each taking 1-3 hours to finish training for 5000 steps. We provide development set BLEU scores in Tables 9 and 10. These scores are computed on the entire development set which includes both seen and unseen domains. In Table 11, we list the exact performance numbers for the few-shot NLG experiments.

B Automatic Metrics
Prior work has used different metrics for different benchmarks. Moreover, for the same metric (e.g. BLEU), different implementations are used. For fair comparison, for each dataset, we report the results using the implementation used in prior work. For E2E, we use the implementation from the e2emetrics 5 suite.   it has become the standard implementation in machine translation literature. We urge the NLG community to also converge upon a single implementation of BLEU. Taking inspiration from MT, the BLEU scores on experiments involving the SGD dataset are computed using sacrebleu.

C Transfer Learning Across Domains
To measure the amount of transfer learning from one domain to another, we evaluate each domain specific model trained in Section 7.1 on all the domains and observe domain specific metrics. Results can be found in Table 12 and 13.

D Templates
In Tables 14, 15

F Example Outputs
Sample utterances generated using the different models for various domains are shown in the examples below. The system actions, its template based representation used by the T2G2 model as input, and the reference response are also provided. The predictions are from models trained on the full SGD dataset and without any dialogue history context. The unseen domains have been marked with an asterisk. Figure 7: Example of a human rater task to evaluate naturalness. Each row represents the output from one of Naive, Schema, T2G2 and Ground Truth. The order of rows is shuffled across different tasks. Figure 8: Example of a human rater task to evaluate informativeness. Each row represents the output from one of Naive, Schema, T2G2 and Ground Truth. The order of rows is shuffled across different tasks. Successfully booked the hotel. No, smoking is not allowed on the property.

Reference
Your reservation is booked. No, smoking is not allowed on the property at this hotel.

Naive
Your reservation has been made. Unfortunately, the hotel does not allow you to make the reservation.

SlotDesc
Your reservation has been made. Unfortunately, they do not allow smoking.

T2G2
I have booked the hotel. No, smoking is not allowed on the property.

Homes
Actions confirm(property name=Almaden lake Apartments) confirm(visit date=March 13th) Template Please confirm the following details: You are scheduling a visit to Almaden Lake Apartments and you want to visit the property on March 13th.

Reference
Please confirm details to schedule a visit to Almaden lake Apartments on March 13th. Naive Please confirm the following details: You are scheduling a visit to Almaden lake Apartments on March 13th.

SlotDesc
Please confirm: You want to visit Almaden lake Apartments on March 13th.

T2G2
Please confirm the following details: You are scheduling a visit to Almaden Lake Apartments on March 13th.

Media
Actions inform(starring=Antonio Bustorff) Template Antonio Bustorff acted in it.

Reference
The actor is Antonio Bustorff.

Naive
The movie is directed by Antonio Bustorff.

SlotDesc
The movie is directed by Antonio Bustorff.

T2G2
Antonio Bustorff acted in the movie.

Music
Actions confirm(track=strangers) confirm(device=Patio) Template Please confirm the following details: playing the song Strangers in Patio Reference Please confirm: play strangers in patio? Naive Please confirm the following details: Play strangers on the Patio.

SlotDesc
Please confirm: play strangers on Patio.

T2G2
Please confirm the following details: playing Strangers in Patio.

Buses
Actions request(from city) request(departure date) Template Where are you leaving from? When do you plan to leave? Reference What is your departure date and city? Naive Where are you leaving from and when do you plan to leave? SlotDesc Where are you leaving from and when do you plan to leave? T2G2 Where are you leaving from and when do you plan to leave? Figure 9: A few examples of utterances generated from different models, along with the system actions, their template based representation (as shown in Figure 1) and the reference output. The unseen domains have been marked with an asterisk.

Events
Actions confirm(event name=Mxmtoon) confirm(number of tickets=2) confirm(date=March 10th) confirm(city=New York) Template Please confirm the following details: Booking tickets for Mxmtoon. You have 2 tickets in your reservation. The date of it is March 10th. The event is taking place in New York.

Reference
Can you please confirm that you want to buy 2 tickets for the Mxmtoon event which happens on March 10th in New York? Naive Please confirm: 2 tickets for Mxmtoon on March 10th in New York.

SlotDesc
Please confirm: 2 tickets to Mxmtoon on March 10th in New York.

T2G2
Please confirm: 2 tickets for Mxmtoon on March 10th in New York.
Alarm * Actions offer(alarm time=4 pm) offer(alarm name=Event) Template There is an alarm for 4 pm with name Event.

Reference
A 4 pm alarm with name Event.

Naive
There is an event at 4 pm.

SlotDesc
There is an event at 4 pm.

T2G2
There is an alarm for 4 pm with event name.
Alarm * Actions offer(alarm time=10:30 am) offer(alarm name=commute) inform count(count=1) Template You have 1 alarms currently set. There is an alarm for 10:30 am with name Commute.

Reference
You have 1 alarm at 10:30 am. with name commute Naive I found 1 suitable for you. It's at 10:30 am and it's a commute. SlotDesc I found 1 alarm at 10:30 am. It's a commute.

T2G2
You have 1 alarm for 10:30 am with the name Commute.

Actions inform(humidity=30) Template
The humidity is around 30 percent.

Reference
The humidity is about 30 percent. Naive It's 30 minutes.

SlotDesc
The humidity is 30.

T2G2
The humidity is around 30 percent.

Weather
Actions inform(humidity=75) inform(wind=4) Template The humidity is around 75 percent. The average wind speed should be 4 miles per hour.

Reference
The humidity for today is around 75 percent. The average wind is 4 miles an hour. Naive It's a 7 star, and it's a 4 star.

SlotDesc
The humidity is 75 and the wind speed is 4 miles per hour.

T2G2
The average wind speed should be 4 miles per hour and the humidity is around 75 percent.

Messaging * Actions notify success Template
You've successfully shared the location.

Reference
The location was successfully shared.

Naive
Your reservation has been made.

SlotDesc
Your contact has been confirmed.

T2G2
Your location has been shared.
Messaging * Actions confirm(location=2190 Bancroft Way) confirm(contact name=Peter) Template Please confirm the following details: You're sharing the location 2190 Bancroft Way with Peter.

Reference
So share the location 2190 Bancroft Way with Peter? Naive Please confirm the following details: You are scheduling a visit to Peter at 2190 Bancroft Way.

SlotDesc
Please confirm the following details: You would like to contact Peter at 2190 Bancroft Way.

T2G2
Please confirm the following details: You're sharing the location 2190 Bancroft Way with Peter.

Reference
There is a train leaving at 7:10 am and costs $91. Naive There is a bus that departs at 7:10 am and costs $91.

SlotDesc
There is a 7:10 am train that costs $91.

T2G2
How about the 7:10 am train? It costs $91 in total.

Travel
Actions offer(attraction name=BODY WORLDS London) offer(category=Museum) Template You should check out BODY WORLDS London. This is a Museum.

Reference
I suggest a museum called BODY WORLDS London. Naive BODY WORLDS London is a Museum.

SlotDesc
BODY WORLDS London is a museum.

T2G2
BODY WORLDS London is a museum.