Few-shot Natural Language Generation for Task-Oriented Dialog

As a crucial component in task-oriented dialog systems, the Natural Language Generation (NLG) module converts a dialog act represented in a semantic form into a response in natural language. The success of traditional template-based or statistical models typically relies on heavily annotated data, which is infeasible for new domains. Therefore, it is pivotal for an NLG system to generalize well with limited labelled data in real applications. To this end, we present FewshotWOZ, the first NLG benchmark to simulate the few-shot learning setting in task-oriented dialog systems. Further, we develop the SC-GPT model. It is pre-trained on a large set of annotated NLG corpus to acquire the controllable generation ability, and fine-tuned with only a few domain-specific labels to adapt to new domains. Experiments on FewshotWOZ and the large Multi-Domain-WOZ datasets show that the proposed SC-GPT significantly outperforms existing methods, measured by various automatic metrics and human evaluations.


Introduction
Task-oriented dialog systems are becoming increasingly popular, as they can assist users in various daily activities such as ticket booking and restaurant reservations. In a typical task-oriented dialog system, the Natural Language Generation (NLG) module plays a crucial role: it converts a system action (e.g., often specified in a semantic form selected by a dialog policy) into a final response in natural language. Hence, the response should be adequate to represent semantic dialog actions, and fluent to engage users' attention. As the ultimate interface to interacts with users, NLG plays a significant impact on the users' experience.
Existing methods for NLG can be broadly summarized into two major categories. (i) Template-1 Semantically-Conditioned Generative Pre-Training based methods require domain experts to handcraft templates for each domain, and the system fills in slot-values afterward (Cheyer and Guzzoni, 2014;Langkilde and Knight, 1998). Thus, the produced responses are often adequate to contain the required semantic information, but not always fluent and nature, hurting users' experiences. (ii) Statistical language models such as neural networks  learn to generate fluent responses via training from labelled corpus. One canonical model is semantically conditioned LSTM (SC-LSTM) (Wen et al., 2015b), which encodes dialog acts with onehot representations and uses it as an extra feature to inform the sentence generation process. Despite its good performance on simple domains, it requires large amounts of domain-specific annotated data which is not available for many domains in realworld applications. Even worse, this renders severe scalability issues when the number of possible combinations of dialog acts grows exponentially with the number of slots in more complex domains.
We revisit the current research benchmarks for NLG, and notice that each dialog domain is extensively labelled to favor model training. However, this is in contrast to the real-world application scenarios, where only very limited amounts of labelled data are available for new domains. To simulate such a few-shot learning setting, we have developed a new benchmark dataset, called FEWSHOT-WOZ, based on the MultiWOZ (Budzianowski et al., 2018) and Cambridge NLG datasets (Wen et al., 2016a). FEWSHOTWOZ consists of dialog utterances from 7 domains. For each domain, we provide less than 50 labeled utterances for finetuning. We believe that FEWSHOTWOZ can better inspire research to address the challenge of learning data-hungry statistical models with very limited amounts of labelled data in real-world scenarios.
To deal with the challenge of few-shot learning, we develop the SC-GPT model. SC-GPT is a multi-layer Transformer neural language model, trained in three steps: (i) Pre-trained on plain text, similar to GPT-2 (Radford et al.); (ii) Continuously pretrained on large amounts of dialog-act labeled utterances corpora to acquire the ability of controllable generation; (iii) Fine-tuned for a target domain using very limited amounts of domain labels. Unlike GPT-2, SC-GPT generates semantically controlled responses that are conditioned on the given semantic form, similar to SC-LSTM but requiring much less domain labels to generalize to new domains.
In summary, our key contributions are three-fold: • A new benchmark FEWSHOTWOZ is introduced to simulate the few-shot adaptation setting where only a handful of training data from each domain is available.
• We propose a new model SC-GPT. To our best knowledge, this work is the first study of exploiting state-of-the-art pre-trained language models for NLG in task-oriented dialog systems.
• On the MultiWOZ dataset, SC-GPT creates a new SOTA, outperforming previous models by 4 points in BLEU. On FEWSHOT-WOZ, SC-GPT outperforms several strong baselines such as SC-LSTM and HDSA , showing that SC-GPT adapts to new domain much more effectively, requiring much smaller amounts of in-domain labels.

Background
A typical task-oriented spoken dialog system uses a pipeline architecture, as shown in Figure 1 (a), where each dialog turn is processed using a fourstep procedure. (i) Transcriptions of user's input are first passed to the natural language understanding (NLU) module, where the user's intention and other key information are extracted. (ii) This information is then formatted as the input to dialog state tracking (DST), which maintains the current state of the dialog. (iii) Outputs of DST are passed to the dialog policy module, which produces a dialog act based on the facts or entities retrieved from external resources (such as a database or a knowledge base). (iv) The dialog act emitted by the dialog policy module serves as the input to the NLG, through which a system response in natural language is generated. In this paper, we focus on the NLG component of task-oriented dialog systems, i.e., how to produce natural language responses conditioned on dialog acts. Specifically, dialog act A is defined as the combination of intent I and slot-value pairs where P is the number of pairs 2 , which varies in different dialog acts.
• Intents are usually used to distinguish different types of system actions. Typical examples include inform, request, confirm, select etc.
• Slot-value pairs indicate the category and content of the information to express in the utterance, respectively.
The goal of NLG is to translate A into a natural language response x = [x 1 , · · · , x T ], where T is the sequence length. In Figure 1 (b), we show an example of the dialog act: confirm (name=Hilton, area=center), and the System Response Dialog Act [BOS] Let me confirm that you are searching for Hinton hotel in the center area [EOS] Figure 2: Illustration of SC-GPT. In this example, SC-GPT generates a new word token (e.g., "confirm" or "center") by attending the entire dialog act and word tokens on the left within the response.
corresponding natural language response is "Let me confirm that you are searching for Hilton in the center area".

Semantically Conditioned GPT
We tackle this generation problem using conditional neural language models. Given training data of N samples D = {(A n , x n )} N n=1 , our goal is to build a statistical model parameterized by θ to characterize p θ (x|A). To leverage the sequential structure of response, one may further decompose the joint probability of x using the chain rule, casting an auto-regressive generation process as follows: where x <t indicates all tokens before t.
Learning θ is performed via maximizing the loglikelihood (MLE) of the conditional probabilities in (2) over the entire training dataset: In this paper, we employ the Transformers (Vaswani et al., 2017) to parameterize the conditionals in (2). To enable strong generalization and controllable ability for the learned model, we propose the following three-stage procedure as the training recipe.
Massive Plain Language Pre-training. Large models trained on massive training corpus usually generalize better to new domains. Inspired by this, we inherit the GPT-2 architecture (Radford et al.) as the backbone language model. GPT-2 is an auto-regressive language model that leverages 12-24 layers of masked, multi-head self-attention Transformers. GPT-2 is pre-trained on extremely massive text data OpenWebText (Radford et al.). It has demonstrated superior performance on characterizing human language data distribution and knowledge transfer. Given text prompts, GPT-2 can often generate realistic sentences.
Dialog-Act Controlled Pre-training. To enable the guidance of dialog act in response generation, we propose to continuously pre-train the GPT-2 model on large amounts of annotated (dialog act, response) pairs. The pre-training dataset 3 includes annotated training pairs from Schema-Guided Dialog corpus, MultiWOZ corpus, Frame corpus, and Facebook Multilingual Dialog Corpus. The total size of the pre-training corpus is around 400k examples.
We firstly pre-process dialog act A into a sequence of control codes using the following format: Meanwhile, the output sequence x is preprocessed via appending x with a special start token [BOS] and an end token [EOS]. Finally, the sequentialized dialog act A is concatenated with its augmented response x , and then fed into GPT-2. During training, the prediction loss is only computed for x , and A provides the attended conditions. Since the dialog act represents the semantics of the generated sentences, we follow the naming convention of SC-LSTM, and term our model as Semantically Conditioned Generative Pre-training (SC-GPT). The overall architecture of SC-GPT is illustrated in Figure 2.
Fine-tuning. For a new domain, a dialog act usually contains novel intents or slot-value pairs, and annotated training samples are often limited. We  fine-tune SC-GPT on limited amounts of domainspecific labels for adaptation. The fine-tuning follows the same procedure of dialog-act controlled pre-training, as described above, but uses only a few dozens of domain labels. It is worth noticing that the above recipe has several favorable properties: • Flexibility. SC-GPT operates on a sequence of tokens without delexicalization, which means that SC-GPT does not assume a fixed onehot or tree-structured dialog act representation vectors. Hence, it has great flexibility in extending to novel dialog acts.
• Controllability. In contrast to GPT-2 that generates natural sentences without high-level semantic guidance, SC-GPT can generate sentences with adequate intent and slot-value information and maintain its fluency.
• Generalizability. SC-GPT is able to generalize significantly better than SC-LSTM, due to the pre-training on massive plain text corpora and annotated dialog datasets.

Dataset: FEWSHOTWOZ
Revisiting NLG Benchmarks. The three commonly used NLG datasets in developing and evaluating task-oriented dialog systems are E2E NLG (Novikova et al., 2017) BAGEL (Mairesse et al., 2010) and RNNLG (Wen et al., 2016a), as summarized in Table 1. We observe two issues from their shared statistics: (i) All the datasets contain a large number of labelled training samples for each domain, ranging from hundreds to tens of thousands. However, the cost of labeling is high in practice, e.g., labeling 50 utterances is 5 hours per domain. Creating such an extensively annotated dataset for each new domain is prohibitively expensive. (ii) The percentage of distinct delexicalised dialog acts between training and testing data is quite small. For example, the delexicalised dialog acts in testing is 100% covered by the training set for the E2E NLG dataset. It renders difficulties in evaluating the model's generalization ability for new domains.
FEWSHOTWOZ. To build a setting for more pragmatic NLG scenarios, we introduce a new dataset FEWSHOTWOZ to better reflect real application complexity, and encourage the community to develop algorithms that are capable of generalizing with only a few domain-specific labels for each (new) domain. The dataset statistics are shown in the last column of Table 1. We see that FEW-SHOTWOZ is different from the other datasets in three aspects: (i) More domains. FEWSHOTWOZ contains seven domains in total, which is larger than any existing NLG datasets. (ii) Less training instances. Importantly, FEWSHOTWOZ has a much smaller number of training instances per domain, aiming to evaluate the few-shot learning ability. (iii) Lower training/testing overlap. FEW-SHOTWOZ has only 8.82% overlap, significantly smaller than the other datasets, which amount to more than 90% overlap. The average number of intents per instance in Attraction/ Taxi/ Train domain is 2, 1.33, and 2.05, respectively. In contrast, there is only one intent for each example in the other datasets. The NLG task defined on FEWSHOTWOZ requires the models to learn to generalize over new compositions of intents. The details of FEWSHOTWOZ is shown in Table 2.
Collection Protocols. We construct FEWSHOT-WOZ via re-organizing data samples from RNNLG and MultiWOZ datasets (Budzianowski et al., 2018). For each domain in RNNLG, we first group utterances according to their delexicalised dialog acts, and keep only one utterance as the target sentence. To ensure diversity, we consider three domains from MultiWOZ: Attraction, Taxi, and Train. Since MultiWOZ is a cross-domain dataset, the dialog act of an utterance may exist in multiple domains. We choose to keep utterances whose dialog act appears only in one domain. Similar delexicalising processing is applied to ensure that each dialog act has only one target utterance. Finally, to simulate the few-shot learning in practice, we randomly sample 50 training examples for each domain, except the Taxi domain, which has 40 examples.

Related Work
Pre-trained Models. Pre-trained language models (PLMs) have substantially advanced the stateof-the-art across a variety of natural language processing ( (Zellers et al., 2019). GPT-2 first investigated missive Transformer-based autoregressive language models with large-scale text data for pre-training. After fine-tuning, GPT-2 achieves drastic improvements on several generation tasks. One drawback of GPT-2 is the lack of high-level semantic controlling ability in language generation. To alleviate this issue, CTRL (Keskar et al., 2019) was introduced to train the model based on pre-defined codes such as text style, content description, and task-specific behavior, meanwhile Grover (Zellers et al., 2019) was proposed to generate news articles conditioned on authors, dates etc. Although conceptually similar to our SC-GPT, CTRL and Grover cannot be readily applied to NLG in task-oriented dialog systems, as the conditioning codes are quite different. Another controllable generation work for GPT-2 is PPLM (Dathathri et al., 2019), which provides a decoding scheme to guide the generation process using key-words or classifiers, without re-training the model. In this paper, we focus on pre-training an NLG model conditioned on finer-grained semantic dialog acts, which are more desirable for dialog systems.
Dialog. Various dialog systems have been developed (Gao et al., , 2020 (Zhu et al., 2019;Zhu, 2020;Mi et al., 2019). However, they all require large amounts of annotated data to reach satisfactory performance. A more realistic scenario is to require much less labeling and improve the sample efficiency of models, This is especially important when deploying the models to new domains, where dialog acts need to be labelled from scratch. Our paper aims to formally set up such a research scenario by proposing a new dataset FEWSHOT-WOZ, and a new model SC-GPT.

Restaurant Laptop
Hotel TV Attraction Train Taxi   (Wolf et al., 2019). We use GPT2-Medium with 345M parameters 7 as the initial checkpoint, and byte pair encodings (Sennrich et al., 2015) for the tokenization. Linear rate scheduler with start rate as 5e-5 was used for both pre-training and fine-tuning. Adam (Kingma and Ba, 2014) with weight decay was used to optimize the parameters. For pretraining, the model was trained with a mini-batch of 8 on an 8 Nvidia V100 machine until observing no significant progress on validation loss or up to 20 epochs, whichever is earlier. For fine-tuning on FEWSHOTWOZ, models were trained on each domain separately with 5 epochs.
Automatic metrics. Following Wen et al.
(2015c), BLEU scores and the slot error rate (ERR) are reported. BLEU score evaluates how natural the generated utterance is compared with human readers. ERR measures the exact matching the slot tokens in the candidate utterances. ERR = (p + q)/M , where M is the total number of slots in the dialog act, and p, q is the number of missing and redundant slots in the given realisation. For each dialogue act, we generate 5 utterances and select the top one with the lowest ERR as the final output.
Human evaluation. We conducted human evaluation using Amazon Mechanical Turk to assess subjective quality. To do this, we recruit master level worker (who has good prior approval rate) to perform a human comparison between generated responses from two systems (which are randomly 7 We also experimented using GPT2 with 117M parameters but observed significant poor performance. sampled from comparison systems). The workers are required to judge each utterance from 1 (bad) to 3 (good) in terms of informativeness and naturalness. Informativeness indicates the extent to which generated utterance contains all the information specified in the dialogue act. Naturalness denotes whether the utterance is as natural as a human does. To reduce the bias in the workers, for each question, we distribute to three different workers. Finally, we collected back approximately 5800 judges.
Baselines. We compare with three baseline methods.  Table 3 reports the automatic evaluation performance of different methods on FEWSHOTWOZ. SC-LSTM fails to learn the generation effectively in this few-shot learning setting. The generated utterances are poor in quality and suffer from inaccurate slot rendering. In addition, GPT-2 performs consistently better than SC-LSTM in all the domains. It reveals the feasibility of using a pretrained language model for NLG, though only limited annotations are available. Importantly, SC-GPT performs significantly better than GPT and SC-LSTM in terms of both BLEU and ERR. In all the domains, SC-GPT reduces the ERR to a significantly lower level, revealing its strong controllability power. This verifies the importance of pre-training on large annotated dialog data, as SC-GPT learns how to generate utterances specified by

Experiments
In this section, we evaluate the proposed SC-GPT on the FEWSHOTWOZ and MultiWoz datasets to answer two research questions: (i) Is SC-GPT an effective model for strong generalization and controlability in dialog response generation? (ii) Does FEWSHOTWOZ meet the goal of effectively evaluating the generalization of NLG models in the few-shot learning setting?

Experimental Setup
Implementation details. The model was built upon Huggingface Pytorch Transformer (Wolf et al., 2019). We use GPT2-Medium with 345M parameters 8 as the initial checkpoint, and byte pair encodings (Sennrich et al., 2015) for the tokenization. Linear rate scheduler with start rate as 5e- 8 We also experimented using GPT2 with 117M parameters but observed significant poor performance. 5 was used for both pre-training and fine-tuning. Adam (Kingma and Ba, 2014) with weight decay was used to optimize the parameters. For pretraining, the model was trained with a mini-batch of 8 on an 8 Nvidia V100 machine until observing no significant progress on validation loss or up to 20 epochs, whichever is earlier. For fine-tuning on FEWSHOTWOZ, models were trained on each domain separately with 5 epochs.
Automatic metrics. Following Wen et al. (2015b), BLEU scores and the slot error rate (ERR) are reported. BLEU score evaluates how natural the generated utterance is compared with human readers. ERR measures the exact matching the slot tokens in the candidate utterances. ERR = (p + q)/M , where M is the total number of slots in the dialog act, and p, q is the number of missing and redundant slots in the given realisation. For each dialog act, we generate 5 utterances and select the top one with the lowest ERR as the final output.
Human evaluation. We conducted human evaluation using Amazon Mechanical Turk to assess subjective quality. We recruit master level workers (who have good prior approval rates) to perform a human comparison between generated responses from two systems (which are randomly sampled from comparison systems). The workers are required to judge each utterance from 1 (bad) to 3 (good) in terms of informativeness and naturalness. Informativeness indicates the extent to which generated utterance contains all the information specified in the dialog act. Naturalness denotes whether the utterance is as natural as a human does. To reduce

Experiments
In this section, we evaluate the proposed SC-GPT on the FEWSHOTWOZ and MultiWOZ datasets to answer two research questions: (i) Is SC-GPT an effective model for strong generalization and controllability in dialog response generation? (ii) Does FEWSHOTWOZ meet the goal of effectively evaluating the generalization of NLG models in the few-shot learning setting?

Experimental Setup
Implementation details. The model was built upon Huggingface Pytorch Transformer (Wolf et al., 2019). We use GPT2-Medium with 345M parameters 7 as the initial checkpoint, and byte pair encodings (Sennrich et al., 2015) for the tokenization. Linear rate scheduler with start rate as 5e-5 was used for both pre-training and fine-tuning. Adam (Kingma and Ba, 2014) with weight decay was used to optimize the parameters. For pretraining, the model was trained with a mini-batch of 8 on an 8 Nvidia V100 machine until observing no significant progress on validation loss or up to 20 epochs, whichever is earlier. For fine-tuning on FEWSHOTWOZ, models were trained on each domain separately with five epochs.
Automatic metrics. Following Wen et al.
(2015b), BLEU scores and the slot error rate (ERR) are reported. BLEU score evaluates how natu- 7 We also experimented using GPT2 with 117M parameters but observed significant poor performance. ral the generated utterance is compared with human readers. ERR measures the exact matching of the slot tokens in the candidate utterances. ERR = (p + q)/M , where M is the total number of slots in the dialog act, and p, q is the number of missing and redundant slots in the given realisation. For each dialog act, we generate five utterances and select the top one with the lowest ERR as the final output.
Human evaluation. We conducted the human evaluation using Amazon Mechanical Turk to assess subjective quality. We recruit master level workers (who have good prior approval rates) to perform a human comparison between generated responses from two systems (which are randomly sampled from comparison systems). The workers are required to judge each utterance from 1 (bad) to 3 (good) in terms of informativeness and naturalness. Informativeness indicates the extent to which generated utterance contains all the information specified in the dialog act. Naturalness denotes whether the utterance is as natural as a human does. To reduce judgement bias, we distribute each question to three different workers. Finally, we collected in total of 5800 judges.
Baselines. We compare with three baseline methods. (i) SC-LSTM (Wen et al., 2015b) is a canonical model and a strong baseline that uses an additional dialog act vector and a reading gate to guide the utterance generation. (ii) GPT-2 (Radford et al.) is used to directly fine-tune on the domain-specific labels, without pre-training on the large-scale corpus of (dialog act, response) pairs.        Table 5: BLEU score of different models on Multi-Woz using training data of different sizes. Table 6 shows the human assessment on FEW-SHOTWOZ. The results exhibit the same trend with automatic evaluation. SC-GPT outperforms GPT and SC-LSTM significantly in both metrics, i.e., SC-GPT can better control the generation to convey information in the dialogue act while maintaining good fluency. Note that the gap between SC-GPT and human annotation is still large, indicating that the proposed FEWSHOTWOZ exhibits an under-explored research area, and provides a large space to encourage future research for improvement.

MultiWoz
The results on MultiWoz are shown in Table 4. Again, SC-GPT achieves the best performance on BLEU score. We exclude GPT in this table since MultiWoz contains 57k utterances in total; it is large enough for GPT to achieve good performance. The results also confirm that with enough annotated data, conditional language model formulation performs significantly better than HDSA, a strong competitor that leverages graph/tree-structure in-

Model
Informativeness Naturalness   Table 7: Human evaluation on MultiWoz. Statistical significance was computed with a two-tailed t-test between SC-GPT and HDSA.
formation to encode dialogue acts.
To investigate how SC-GPT performs with different training data sizes. We further conduct experiments with varying percentage of training data on MultiWoz, ranging from 0.1% (50 examples) to 50%. As shown in Table 5, the observations are consistent with FEWSHOTWOZ. For the small data size, GPT outperforms HDSA by a large margin. Further, SC-GPT performs consistently better than HDSA and SC-LSTM. The improvement is more obvious in the fewer data samples setting, which validates our observation on FEWSHOTWOZ that SC-GPT is more effective on controlled generation. Table 7 shows the human assessment results on MultiWoz. The results are consistent with the automatic evaluation. It is interesting to observe that (i) the gap between the state-of-the-art method (i.e., SC-GPT ) and human performance on FEWSHOT-WOZ is much larger than that on MultiWoz; (ii) the human rating on the naturalness of SC-GPT is even higher than humans on MultiWoz, while poses a marginal gap on FEWSHOTWOZ. These results demonstrate that there is an abundant research space to explore with FEWSHOTWOZ, and SG-GPT serves as a simple and strong baseline to evaluate a model's ability to generalize and generate adequate and fluent responses.

Analysis
Example dialogue acts and their generated utterance from different methods are shown in Table 8.

Conclusion and Future Work
In this paper, we have made two major contributions towards developing a more pragmatic NLG module in task-oriented dialogue systems: (i) A new benchmark FEWSHOTWOZ is introduced to simulate the few-shot learning scenarios with  Table 9: Human evaluation on MultiWoz. Statistical significance was computed with a two-tailed t-test between SC-GPT and HDSA.
2016b) is used to evaluate the entity coverage accuracy (including all slot values, days, numbers, and reference, etc.). Again, SC-GPT achieves the best performance on BLEU score. Note that GPT-2 performs similarly with SC-GPT on the full Mul-tiWoz dataset, this is because MultiWoz contains 57k utterances, which is large enough for GPT-2 to achieve good performance. The results also confirm that with enough annotated data, conditional language model formulation performs significantly better than HDSA, a strong competitor that leverages graph/tree-structure information to encode dialog acts.
To study how SC-GPT performs with different in this few-shot learning setting. The generated utterances are poor in quality and suffer from inaccurate slot rendering. In addition, GPT-2 performs consistently better than SC-LSTM in all the domains. It reveals the feasibility of using a pretrained language model for NLG, though only limited annotations are available for fine-tuning. Importantly, SC-GPT performs significantly better than GPT and SC-LSTM in terms of both BLEU and ERR. In all the domains, SC-GPT reduces the ERR to a significantly lower level, revealing its strong controllability power. This verifies the importance of pre-training on large annotated dialog data, as SC-GPT learns how to generate utterances specified by the dialog acts accurately. Table 4 shows the human assessment on FEW-SHOTWOZ. The results exhibit the same trend with automatic evaluation. SC-GPT outperforms GPT-2 and SC-LSTM significantly in both metrics, i.e., SC-GPT can better control the generation to convey information in the dialog act while maintaining good fluency. Note that the gap between SC-GPT and human annotation is still large, indicating that the proposed FEWSHOTWOZ exhibits an under-explored research area, and provides a large space to encourage future research for improvement.

MultiWOZ
The results on MultiWOZ are shown in Table 5. Following ), Entity F1 (Wen et al., 2016b is used to evaluate the entity coverage accuracy (including all slot values, days, numbers, and reference, etc.). Again, SC-GPT achieves the best performance on BLEU score. Note that GPT-2 performs similarly with SC-GPT on the full Multi-  WOZ dataset, this is because MultiWOZ contains 57k utterances, which is large enough for GPT-2 to achieve good performance. The results also confirm that with enough annotated data, conditional language model formulation performs significantly better than HDSA, a strong competitor that leverages graph/tree-structure information to encode dialog acts.
To study how SC-GPT performs with different training data sizes. We further conduct experiments with varying percentages of training data on Mul-tiWOZ, ranging from 0.1% (50 examples) to 50%. As shown in Table 6, the observations are consistent with FEWSHOTWOZ. SC-GPT performs consistently better than GPT-2, HDSA, and SC-LSTM for a wide range of dataset sizes, and the improvement is more substantial when the fewer numbers of in-domain labels are used for fine-tuning. Table 7 shows the human assessment results on MultiWOZ. The results are consistent with the automatic evaluation. It is interesting to see that (i) the gap between the new state-of-the-art method (i.e., SC-GPT ) and human performance on FEW-SHOTWOZ (as shown in Table 4) is much larger than that on MultiWOZ; (ii) the human rating on the naturalness of SC-GPT is even higher than humans on MultiWOZ, while there is a visible gap on FEWSHOTWOZ. These results demonstrate that FEWSHOTWOZ presents a challenging few-shot learning setting, SG-GPT serves as a simple and strong baseline in this setting, and the combined provides a platform for researchers to develop NLG models that are able to generalize to new domains and generate semantically conditioned and fluent responses.

Analysis
We perform detailed analysis to investigate SG-GPT's flexibility, controllability and generalizability. The test set is split into two subsets -seen and   unseen. If a dialog act of an example appears in the training set, the example is marked as seen; otherwise, it is marked as unseen. Table 9 compares different models on the seen and unseen subsets in the restaurant domain. SC-GPT yields higher BLEU and lower ERR, and the improvement is more significant on the unseen set. For example, SC-GPT reduces ERR to 4.96, an order of magnitude lower than SC-LSTM and only 1/3 of GPT-2. This demonstrates that SC-GPT generalizes well to novel dialog acts, and is able to precisely ground in them to compose fluent responses. This is further confirmed by the quantitative comparison in Table 8, where we compare the generated utterance examples of different models. While the baseline methods prone to over-generate or miss important slots, SC-GPT can successfully generate fluent nat-ural language utterances that share precise semantic conditions with the ground-truth references. We further simulate the process when deploying SC-GPT for a new domain, using the examples provided in the RASA dialog toolkit 8 . We first fine-tune SC-GPT using a few training examples (only 16 instances in this new domain), and then generate utterances based on novel dialog acts that are unseen in training data. Table 10 shows some examples of generated utterances with novel dialog acts. In practice, it is desirable for an NLG system to deal with an extending domain whose dialog acts change dynamically. We simulate the setting by editing the original input dialog acts, such as inserting or deleting a slot, or substituting a slot value.
Since SC-LSTM is infeasible in the setting of an extending domain, we compare SC-GPT with GPT-2. Results show that SC-GPT produces better utterances than GPT-2. SC-GPT can generate reasonably good natural language responses with different combinations of editing operations, showing its high flexibility to generalize to new dialog acts with very limited training data, and produce #

Model Generated Responses from Different Models
Original DA ack makereservation ( price = moderate ; restaurantcusine = chinese ; location = center ; peoplecount = 2 ) Reference ok making a reservation for restaurant moderate chinese in center for two people?
1 Input DA ack makereservation ( price = moderate ; restaurantcusine = japanese ; location = lincoln square ; peoplecount = 2 ) GPT-2 ok making reservation for moderate japanese restaurant in lincoln square for 2 people? SC-GPT ok making a reservation for a moderate japanese cuisine restaurant in lincoln square for 2 people?
SC-GPT ok making a reservation for sakura, moderate japanese cuisine in lincoln square for 2 people?

Model Generated Responses from Different Models
Original DA ack makereservation ( childrenfriendly = true ) Reference make reservation for a children friendly restaurant ?
SC-GPT make reservation for restaurant sakura moderate price and children friendly restaurant? Table 10: Examples of generated utterances with novel dialog acts. SC-GPT produces better utterances than GPT-2 for with edited dialog acts. Since both models produce similar responses to references for the original dialog act, the results are not shown here. (Better viewed in color. insert a slot , substitute a slot value , delete a slot ).

Conclusion
In this paper, we have made two major contributions towards developing a more pragmatic NLG module for task-oriented dialog systems: (i) A new benchmark FEWSHOTWOZ is introduced to simulate the few-shot learning scenarios with scarce labelled data in real-world applications. (ii) A new model SC-GPT is proposed to endow the NLG module with strong semantically controlling and generalization ability. Empirical results on both FEWSHOTWOZ and MultiWOZ show that SC-GPT achieves the best overall performance in both automatic and human evaluations. There are two interesting directions for future work. The first is to design mechanisms to generate more interpersonal responses which are proven to help improve user experiences (Li et al., 2016;. The other is to generalize the generative pre-training idea to all four modules in the dialog system pipeline for end-to-end training. Since these four modules process information in order, one may organize their input/output as segments, and pre-train a segment-level autoregressive model.