A Comparative Study on Schema-Guided Dialogue State Tracking

Frame-based state representation is widely used in modern task-oriented dialog systems to model user intentions and slot values. However, a fixed design of domain ontology makes it difficult to extend to new services and APIs. Recent work proposed to use natural language descriptions to define the domain ontology instead of tag names for each intent or slot, thus offering a dynamic set of schema. In this paper, we conduct in-depth comparative studies to understand the use of natural language description for schema in dialog state tracking. Our discussion mainly covers three aspects: encoder architectures, impact of supplementary training, and effective schema description styles. We introduce a set of newly designed bench-marking descriptions and reveal the model robustness on both homogeneous and heterogeneous description styles in training and evaluation.


Introduction
From early frame-driven dialog system GUS (Bobrow et al., 1977) to virtual assistants (Alexa, Siri, and Google Assistant et al.), frame-based dialog state tracking has long been studied to meet various challenges. In particular, how to support an everincreasing number of services and APIs spanning multiple domains has been a focal point in recent years, evidenced by multi-domain dialog modeling (Budzianowski et al., 2018;Byrne et al., 2019;Shah et al., 2018a) and transferable dialog state tracking to unseen intent/slots (Mrkšić et al., 2017;Wu et al., 2019;Hosseini-Asl et al., 2020).
Recently, Rastogi et al. (2019) proposed a new paradigm called schema-guided dialog for transferable dialog state tracking by using natural language description to define a dynamic set of service schemata. As shown in Figure 1, the primary motivation is that these descriptions can offer effective * * Work done when Jie Cao was an intern at Amazon Figure 1: An example dialog from Restaurant_1 service, along with its service/intent/slot descriptions and dialog state representation. knowledge sharing across different services, e.g., connecting semantically similar concepts across heterogeneous APIs, thus allowing a unified model to handle unseen services and APIs. With the publicly available schema-guided dialog dataset (SG-DST henceforward) as a testbed, they organized a state tracking shared task composed of four subtasks: intent classfication (Intent), requested slot identification (Req), categorical slot labeling (Cat), and noncategorical slot labeling (NonCat) (Rastogi et al., 2020). Many participants achieved promising performance by exploiting the schema description for dialog modeling, especially on unseen services.
Despite the novel approach and promising results, current schema-guided dialog state tracking task only evaluates on a single dataset with limited variation in schema definition. It is unknown how this paradigm generalizes to other datasets and other different styles of descriptions. In this paper, we focus our investigation on the study of three aspects in schema-guided dialog state tracking: (1) schema encoding model architectures (2) supplementary training on intermediate tasks (3) various styles for schema description. To make a more general discussion on the schema-guided dialog state tracking, we perform extensive empirical studies on both SG-DST and MULTIWOZ 2.2 datasets. In summary, our contributions include: • A comparative study on schema encoding architectures, suggesting a partial-attention encoder for good balance between inference speed and accuracy.
• An experimental study of supplementary training on schema-guided dialog state tracking, via intermediate tasks including natural language inference and question answering.
• An in-depth analysis of different schema description styles on a new suite of benchmarking datasets with variations in schema description for both SG-DST and MULTIWOZ 2.2.

Schema-Guided Dialog State Tracking
A classic dialog state tracker predicts a dialog state frame at each user turn given the dialog history and predefined domain ontology. As shown in Figure 1, the key difference between schema-guided dialog state tracking and the classic paradigm is the newly added natural language descriptions. In this section, we first introduce the four subtasks and schema components in schema-guided dialog state tracking, then we outline the research questions in our paper. Subtasks. As shown in Figure 1, the dialog state for each service consists of 3 parts: active intent, requested slots, user goals (slot values). Without loss of generality, for both SG-DST and MULTIWOZ 2.2 datasets, we divide their slots into categorical and non-categorical slots by following previous study on dual-strategies (Zhang et al., 2019). Thus to fill the dialog state frame for each user turn, we solve four subtasks: intent classification (Intent), requested slot identification (Req), categorical slot labeling (Cat), and non-categorical slot labeling (NonCat). All subtasks require matching the current dialog history with candidate schema descriptions for multiple times.
Schema Components. Figure 1 shows three main schema components: service, intent, slot. For each intent, the schema also describes optional or required slots for it. For each slot, there are flags indicating whether it is categorical or not. Categorical means there is a set of predefined candidate values (Boolean, numeric or text). For instance, has_live_music in Figure 1 is a categorical slot with Boolean values. Non-categorical, on the other hand, means the slot values are filled from the string spans in the dialog history. New Questions. These added schema descriptions pose the following three new questions. We discuss each of them in the following sections.  (2019), we also conduct our comparative study on the two typical architectures Cross-Encoder (Bordes et al., 2014;Lowe et al., 2015) and Dual-Encoder (Wu et al., 2017;Yang et al., 2018). However, they only focus on sentence-level matching tasks. All subtasks in our case require sentence-level matching between dialog context and each schema, while the non-categorical slot filling task also needs to produce a sequence of token-level representation for span detection. Hence, we study multi-sentence encoding for both sentence-level and token-level tasks. Moreover, to share the schema encoding across subtasks and turns, we also introduce a simple Fusion-Encoder by caching schema token embeddings in §5.1, which improves efficiency without sacrificing much accuracy.  (Peng et al., 2020;Hosseini-Asl et al., 2020), modeling DST as a question answering task (Zhang et al., 2019;Lee et al., 2019;Gao et al., 2020Gao et al., , 2019. Our work is similar with the last class. However, we further investigate whether the DST can benefit from NLP tasks other than question answering. Furthermore, without rich description for the service/intent/slot in the schema, previous works mainly focus on simple format on question answering scenarios, such as domain-slottype compounded names (e.g., "restaurant-food"), or simple question template "What is the value for slot i?". We incorporate different description styles into a comparative discussion on §7.1.

Datasets
To the best of our knowledge, at the time of our study, SG-DST and MULTIWOZ 2.2 are the only two publicly available corpus for schema-guided dialog study. We choose both of them for our study.
In this section, we first introduce these two representative datasets, then we discuss the generalizibility in domain diversity, function overlapping, data collecting methods.
Schema-Guided Dialog Dataset. SG-DST dataset 1 is especially designed as a test-bed for schema-guided dialog, which contains welldesigned heterogeneous APIs with overlapping functionalities between services (Rastogi et al., 2019). In DSTC8 (Rastogi et al., 2020), SG-DST was introduced as the standard benchmark dataset for schema-guided dialog research. SG-DST covers 20 domains, 88 intents, 365 slots. 2 However, previous research are mainly conducted based on this single dataset and the provided single description style. In this paper, we further extended this dataset with other benchmarking description styles as shown in §7, and then we perform both homogenous and hetergenous evalution on it. Remixed MultiWOZ 2.2 Dataset. To eliminate potential bias from the above single SG-DST dataset, we further add MULTIWOZ 2.2 (Zang et al., 2020) to our study. Among various extended versions for MultiWOZ dataset (2.0-2.3, Budzianowski et al., 2018;Eric et al., 2020;Zang et al., 2020;Han et al., 2020) , besides rectifying the annotation errors, MULTIWOZ 2.2 also introduced the schema-guided annotations, which covers 8 domains, 19 intents, 36 slots. To evaluate performance on seen/unseen services with Multi-WOZ, we remix the MULTIWOZ 2.2 dataset to include as seen services dialogs related to restaurant, attraction and train during training, and eliminate slots from other domains/services from training split. For dev, we add two new domains hotel and taxi as unseen services. For test, we add all remaining domains as unseen, including those that have minimum overlap with seen services, such as hospital, police, bus. The statistics of data splits are shown in Appendix A.2. Note that this data split is different from the previous work on zero-shot MultiWOZ DST which takes a leave-one-out approach in Wu et al. (2019). By remixing the data in the way described above, we can evaluate the zeroshot performance on MultiWOZ in a way largely compatible with SG-DST. Discussion. First, the two datasets cover diverse domains. MULTIWOZ 2.2 covers various possible dialogue scenarios ranging from requesting basic information about attractions through booking a hotel room or travelling between cities. While SG-DST covers more domains, such as 'Payments', 'Calender', 'DoctorServices' and so on. Third, they are collected by two different approaches which are commonly used in dialog collecting. SG-DST is firstly collected by machine-tomachine self-play (M2M, Shah et al., 2018b) with dialog flows as seeds, then paraphrased by crowdworkers. While MULTIWOZ 2.2 are human-tohuman dialogs (H2H, Kelley, 1984), which are collected with the Wizard-of-Oz approach.
We summarize the above discussion in Table  1. We believe that results derived from these two representative datasets can guide future research in schema guided dialog.

Dialog & Schema Representation and Inference (Q1)
In this section, we focus on the model architecture for matching dialog history with schema descriptions using pretrained BERT (Devlin et al., 2019) 3 . To support four subtasks, we first extend Dual-Encoder and Cross-Encoder to support both sentence-level matching and token-level prediction. Then we propose an additional Fusion-Encoder strategy to get faster inference without sacrificing much accuracy. We summarize different architectures in Figure 2. Then we show the classification head and results for each subtask. 3 We use BERT-base-cased for all main experiments. Other pretrained language models can be easily adapted to our study Figure 2: Dual-Encoder, Cross-Encoder and Fusion Encoder, shaded block will be cached during training

Encoder Architectures
Dual-Encoder. It consists of two separate BERTs to encode dialog history and schema description respectively, as Figure 2 (a). We follow the setting in the official baseline provided by DSTC8 Track4 (Rastogi et al., 2020). We first use a fixed BERT to encode the schema description once and cached the encoded schema CLS S . Then for sentence-level representation, we concatenate dialog history representation CLS D and candidate schema representation CLS S as the whole sentence-level representation for the pair, denoted as CLS DE . For token-level representation, we concatenate the candidate schema CLS S with each token embedding in the dialog history, denoted as TOK DE . 4 Because the candidate schema embeddings are encoded independently from the di-alog context, they can be pre-computed once and cached for fast inference. Cross-Encoder. Another popular architecture as Figure 2 (b) is Cross-Encoder, which concatenates the dialog and schema as a single input, and encodes jointly with a single self-attentive encoder spanning over the two segments. When using BERT to encode the concatenated sentence pair, it performs full (cross) self-attention in every transformer layers, thus offer rich interaction between the dialog and schema. BERT naturally produces a summarized representation with [CLS] embedding CLS CE and each schema-attended dialog token embeddings TOK CE . Since the dialog and schema encoding always depend on each other, it requires recomputing dialog and schema encoding for multiple times, thus much slower in inference. Fusion-Encoder. In Figure 2 (c), similar to Dual-Encoder, Fusion-Encoder also encodes the schema independently with a fixed BERT and finetuning another BERT for dialog encoding. However, instead of caching a single [CLS] vector for schema representation, it caches all token representation for the schema including the [CLS] token. What's more, to integrate the sequences dialog token representation with schema token representation, an extra stack of transformer layers are added on top to allow token-level fusion via self-attention, similar to Cross-Encoder. The top transformer layers will produce embeddings for each token TOK F E including a schema-attended CLS F E of the input [CLS] from the dialog history. With cached schema token-level representations, it can efficiently produce schema-aware sentence-and token-level representation for each dialog-schema pairs.

Model Overview
All the above 3 encoders will produce both sentence-and token-level representations for a given sentence pair. In this section, we abstract them as two representations CLS and TOK, and present the universal classification heads to make decisions for each subtask. Active Intent. To decide the intent for current dialog turn, we match current dialog history D with each intent descriptions I 0 ...I k . For each dialog-intent pair (D, I k ), we project the final sentence-level CLS representation to a single number P active I k with a linear layer follows a sigmoid function. We predict "NONE" if the P active I k of all intents are less than a threshold 0.5, which means no intent is active. Otherwise, we predict the intent with largest P active I k . We predict the intent for each turn independently without considering the prediction on previous turns. Requested Slot. As in Figure 1, mulitple requested slots can exist in a single turn. We use the same strategy as in active intent prediction to predict a number P active req . However, to support the multiple requested slots prediction. We predict all the requested slots with P active req > 0.5.
Categorical Slot. Categorical slots have a set of candidate values. We cannot predict unseen values via n-way classification. Instead, we do binary classification on each candidate value. Besides, rather than directly matching with values, we also need to check that whether the corresponding slot has been activated. For Cross-Encoder and Fusion-Encoder, we use typical two-stage state tracking to incrementally build the state: Step 1. Using CLS to predict the slot status as none, dontcare or active. When the status is active, we use the predicted slot value; Otherwise, it will be assigned to dontcare meaning no user preference for this slot, or none meaning no value update for the slot in current turn; Step 2. If Step 1 is active, we match the dialog history with each value and select the most related value by ranking. We train on cross entropy loss. Two-stage strategy is efficient for Dual-Encoder and Fusion-Encoder, where cached schema can be reused, and get efficiently ranked globally in a single batch. However, it is not scalable for Cross-Encoder, especially for large number of candidate values in MultiWOZ dataset. Hence, during training, we only use a binary cross-entropy for each single value and postpone the ranking only to the inference time. Noncategorical Slot. The slot status prediction for noncategorical slot use the same two-stage strategy. Besides that, we use the token representation of dialog history TOK to compute two softmax scores f i start and f i end for each token i, to represent the score of predicting the token as start and end position respectively. Finally, we find the valid span with maximum sum of the start and end scores.

Experiments on Encoder Comparison
To fairly compare all three models, we follow the same schema input setting as in Table 2. We trained separate models for SG-DST and the remixed Mul-tiWOZ datasets for all the experiments in our pa-Intent service description, intent description Req service description, slot description Cat slot description, cat value NonCat service description, slot description Table 2: Schema description input used for different tasks to compare Dual-Encoder, Cross-Encoder, and Fusion-Encoder. In the appendix A.3, we also studies other compositions of description input. We found that service description will not help for Intent, Req and Cat tasks, while the impact on NonCat task also varies from SG-DST and MULTIWOZ 2.2 dataset.   Rastogi et al. (2019). Other models are trained with the architecture described in our paper. pers 5 . Because there are very few intent and requested slots in MULTIWOZ 2.2 dataset, we ignore the intent and requested slots tasks for MUL-TIWOZ 2.2 in our paper. Results. As shown in Table 3, Cross-Encoder performs the best over all subtasks. Our Fusion-Encoder with partial attention outperforms the Dual-Encoder by a large margin, epsecially on categorical and noncategorical slots predictions. Additionally, on seen services, we found that Dual-Encoder and Fusion-Encoder can perform as good as Cross-Encoder on Intent and Req tasks. However, they cannot generalize well on unseen services as Cross-Encoder. Inference Speed. To test the inference speed, we conduct all the experiments with a maximum affordable batch size to fully exploit 2 V100 GPUs (with 16GB GPU RAM each). During training, we log the inference time of each evaluation on dev set. Both Dual-Encoder and Fusion-Encoder can do joint inference across 4 subtasks to obtain an integral dialog state for a dialog turn example. Dual-Encoder achieves the highest inference speed of 603.35 examples per GPU second, because the 5 Appendix A.1 shows the detailed experiment setup encoding for dialog and schema are fully separated. A dialog only needed to be encoded for once during the inference of a dialog state example while the schema are precomputed once. However, for Cross-Encoder, to predict a dialog state for a single turn, it need to encode more than 300 sentence pairs in a batch, thus only processes 4.75 examples per GPU second. Fusion-Encoder performs one time encoding on dialog history, but it needs to jointly encode the same amount of dialog-schema pair ws Cross-Encoder, instead, however, with a two-layer transformer encoder. Overall it achieves 10.54 examples per GPU second, which is 2.2x faster than Cross-Encoder. With regarding to the accuracy in Table 3, Fusion-Encoder performs much better than Dual-Encoder, especially on unseen services.

Supplementary Training (Q2)
Besides the pretrain-fintune framework used in §5, Phang et al. (2018) propose to add a supplementary training phase on an intermediate task after the pretraining, but before finetuning on target task. It shows significant improvement on the target tasks. Moreover, large amount pretrained and finetuned transformer-based models are publicly accessible, and well-organized in model hubs for sharing, training and testing 6 . Given the new task of schemaguided dialog state tracking, in this section, we study our four subtasks with different intermediate tasks for supplementary training.

Intermediate Tasks
As described in § 5.2, all our 4 subtasks take a pair of dialog and schema description as input, and predict with the summerized sentence-pair CLS representation. While NonCat also requires span-based detection such as question answering. Hence, they share the similar problem structure with the following sentence-pair encoding tasks. Natural Language Inference. Given a hypothesis/premise sentence pair, natural language inference is a task to determine whether a hypothesis is entailed, contradicted or neutral given that premise. Question Answering. Given a passage/question pairs, the task is to extract the span-based answer in the passage.
Hence, when finetuning BERT on our subtaks, instead of directly using the originally pretrained BERT, we use the BERT finetuned on the above   Table 4 shows the performances gain when finetuning 4 subtasks based on models with the above SNLI and SQuAD2.0 supplementary training.

Results on Supplementary Training
We mainly find that SNLI helps on Intent task, SQuAD2 mainly helps on NonCat task, while neither of them helps much on Cat task. Recently, Namazifar et al. (2020) also found that when modeling dialog understanding as question answering task, it can benefit from a supplementary training on SQuAD2 dataset, especially on few-shot scenarios, which is a similar findings as our NonCat task. Result difference on Req task is minor, because it is a relatively easy task, adding any supplementary training did n't help much. Moreover, for Cat task, the sequence 2 of the input pair is the slot description with a categorical slot value, thus the meaning overlapping between the full dialog history and the slot/value is much smaller than SNLI tasks. On the other side, CLS token in SQuAD BERT is finetuned for null predictions via start and end token classifers, which is different from the the single CLS classifer in Cat task.

Impact of Description Styles (Q3)
Previous work on schema-guided dialog (Rastogi et al., 2020) are only based on the provided descriptions in SG-DST dataset. Recent work on modeling dialog state tracking as reading comprehension (Gao et al., 2019) only formulate the descriptions as simple question format with existing intent/slot names, it is unknown how it performs when compared to other description styles. Moreover, they only conduct homogeneous evaluation where training and test data share the same descrip-tion style. In this section, We also investigate how a model trained on one description style will perform on other different styles, especially in a scenario where chat-bot developers may design their own descriptions. We first introduce different styles of descriptions in our study, and then we train models on each description style and evaluate on tests with corresponding homogeneous and heterogeneous styles of descriptions. Given the best performance of Cross-Encoder shown in the previous section and its popularity in DSTC8 challenges, we adopt it as our model architecture in this section.

Benchmarking Styles
For each intent/slot, we describe their functionalities by the following different descriptions styles: Identifer. This is the least informative case of name-based description: we only use meaningless intent/slot identifiers, e.g. Intent_1, Slot_2. It means we don't use description from any schema component. We want to investigate how a simple identifier-based description performs in schemaguided dialog modeling, and the performance lower-bound on transferring to unseen services. NameOnly. Using the original intent/slot names in SG-DST and MULTIWOZ 2.2 dataset as descriptions, to show whether name is enough for schema-guided dialog modeling. Q-Name. This is corresponding to previous work by Gao et al. (2019). For each intent/slot, it generate a question to inquiry about the intent and slot value of the dialog. For each slot, it simply follows the template 'What is the value for slot i?'. Besides that, our work also extend the intent description by following the template "Is the user intending to intent j ". Orig. The original descriptions in SG-DST and MULTIWOZ 2.2 dataset. Q-Orig. Different from the Q-Name, firstly it is based on the original descriptions; secondly, rather than always use the "what is" template to inquiry the intent/slot value, We add "what", "which", "how many" or "when" depending on the entity type required for the slot. Same as Q-Name, we just add prefixes as "Is the user intending to. . . " in front of the original description. In a sum, this description is just adding question format to original description. The motivation of this description is to see whether the question format is helpful or not for schema-guided dialog modeling.
To test the model robustness, we also create two paraphrased versions Name-Para and Orig-Para for NameOnly and Orig respectively. We first use nematus (Sennrich et al., 2017) to automatically paraphrase the description with back translation, from English to Chinese and then translate back, then we manually check the paraphrase to retain the main meaning. Appendix A.5.1 shows examples for different styles of schema descriptions.

Results on Description Styles
Unlike the composition used in Table 2, we don't use the service description to avoid its impact. For each style, we train separate models on 4 subtasks, then we evaluate them on different target styles. First, Table 5 summarizes the performance for homogeneous evaluation, while Table 6 shows how the question style description can benefit from SQuAD2 finetuning. Then we also conduct heterogeneous evaluation on the other styles 7 as shown in Table 7

Homogeneous Evaluation
Is name-based description enough? As shown in Table 5, Identifer is the worst case of using name description, its extremely bad performance indicates name-based description can be very unstable. However, we found that simple meaningful name-based description actually can perform the best in Intent and Req task, and they perform 7 We don't consider the meaningless Identifer style due to its bad performance worse on Cat and NonCat tasks comparing to the bottom two rich descriptions. 8 After careful analysis on the intents in SG-DST datasets, we found that most services only contains two kinds of intents, an information retrieval intent with a name prefix "Find-", "Get-", "Search-"; another transaction intent like "Add-", "Reserve-" or "Buy-". Interestingly, we found that all the intent names in the original schema-guided dataset strictly follows an action-object template with a composition of words without abbreviation, such as "FindEvents", "BuyEventTickets". This simple name template is good enough to describe the core functionality of an intent in SG-DST dataset. 9 Additionally, Req is a relaitively simper task, requesting information are related to specifial attributes, such as "has_live_music", "has_wifi", where keywords cooccured in the slot name and in the user utterance, hence rich explanation cannot help further. On the other side, rich descriptions are more necessary for Cat and NonCat task. Because in many cases, slot names are too simple to represent the functionalities behind it, for example, slot name "passengers" cannot fully represent the meaning "number of passengers in the ticket booking". Does question format help? As shown in Table 5, when comparing row Q-Orig v.s. Orig, we found extra question format can improve the performance on Cat and NonCat task on both SG-DST and MULTIWOZ 2.2 datasets, but not for Intent and Req tasks. We believe that question format helps the model to focus more on specific entities in the dialog history. However, when adding a simple question pattern to NameOnly, comparing row Q-Name and NameOnly, there is no consistent improvement on both of the two datasets. Further more, we are curious about whether BERT finetuned on SQuAD2 (SQuAD2-BERT) can further help on the question format. Because Non-Cat are similar with span-based question answering, we focus on NonCat here. Table 6 shows that, after applying the supplementary training on SQuAD2 ( §6), almost all models get improved on unseen splits however slightly dropped on seen services. Moreover, comparing to Q-Name, Q-  Orig is more similar to the natural questions in the SQuAD2, we obverse that Q-Orig gains more than Q-Name from pretrained model on SQuAD2.

Heterogeneous
In this subsection, we first simulate a scenario when there is no recommended description style for the future unseen services. Hence, unseen services can follow any description style in our case. We average the evaluation performance on three other descriptions and summarized in Table 7. The ∆ column shows the performance change compared to the homogeneous performance. It is not surprising that almost all models perform worse on heterogeneous styles than on homogeneous styles due to different distribution between training and evaluation. The bold number shows the best average performance on heterogeneous evaluation for each subtask. The trends are similar with the analysis in homogeneous evaluation 7.2.1, the name-based descriptions perform better than other rich descriptions on intent classification tasks. While on other tasks, the Orig description performs more robust, especially on NonCat task.
Furthermore, we consider another scenario where fixed description convention such as Name-Only and Orig are suggested to developers, they must obey the basic style convention but still can freely use their own words, such as abbreviation, synonyms, adding extra modifiers. We train each model on NameOnly and Orig, then evaluate on the corresponding paraphrased version respectively. In the last two rows of Table 7, the column 'para' shows performance on paraphrased schema, while ∆ shows the performance change compared to the homogeneous evaluation. Orig still performs more robust than NameOnly when schema descriptions get paraphrased on unseen services.

Conclusion
In this paper, we studied three questions on schemaguided dialog state tracking: encoder architectures, impact of supplementary training, and effective schema description styles. The main findings are as follows: By caching the token embedding instead of the single CLS embedding, a simple partial-attention Fusion-Encoder can achieve much better performance than Dual-Encoder, while still infers two times faster than Cross-Encoder. We quantified the gain via supplementary training on two intermediate tasks. By carefully choosing representative description styles according to recent works, we are the first of doing both homogeneous/heterogeneous evaluations for different description style in schema-guided dialog. The results show that simple name-based description performs well on Intent and Req tasks, while NonCat tasks benefits from richer styles of descriptions. All tasks suffer from inconsistencies in description style between training and test, though to varying degrees.
Our study are mainly conducted on two datasets: SG-DST and MULTIWOZ 2.2, while the speedaccuracy balance of encoder architectures and the findings in supplementary training are expected to be dataset-agnostic, because they depend more on the nature of the subtasks than the datasets. Based on our proposed benchmarking descriptions suite, the homogeneous and heterogeneous evaluation has shed the light on the robustness of cross-style schema-guided dialog modeling, we believe our study will provide useful insights for future research.

A.1 Experiment Setup
All models are based on BERT-base-cased model with 2 V100 GPUs (with 16GB GPU RAM each). We train each models for maximum 10 epoch, by using AdamW to schedule the learning rate with a warm-up portion of 0.1. During training, we evaluate checkpoints per 3000 steps on dev splits, and select the model with best performance on dev split on all seen and unseen services.

A.3.1 Composition Settings
For each subtask, the key description element must be included, e.g., intent description for intent task, and value for categorical slot tasks. To show how each component helps schema-guided dialog state tracking, we incrementally add richer schema component one by one.
ID. This is the least informative case: we only use meaningless intent/slot identifiers, e.g. Intent_4, Slot_2. It means we don't use description from any schema component. We want to investigate how a simple identifier-based description performs in schema-guided dialog modeling, and the performance lower-bound on transferring to unseen services. I/S Desc. Only using the original intent/slot description of intent/slot in SG-DST and MULTI-WOZ 2.2 dataset for corresponding tasks. Service + I/S Desc. Adding a service description to the above original description. Service description summarize the functionalities of the whole service, hence may offer extra background information for intent and slots. For categorical slot value detection, we simply add the value after each of the above composition. Table 9 shows the results of using different description compositions. First, there are consistent findings across datasets and subtasks: (1) using meaningless identifier as intent/slot description shows the worse performance on all tasks of both datasets, and can not generalize well to unseen services.

A.3.2 Results on Description Compositions
(2) using intent/slot descriptions can largely boost the performance, especially on unseen services.  Table 9: Models using different composition of schema, results on test set of SG-DST and our remixed MULTIWOZ 2.2 However, the impact of service description varies by tasks. For example, it largely hurts performance on intent classification task, but does not impact requested slot and categorical slot tasks. According to manual analysis of SG-DST and MUL-TIWOZ 2.2 dataset, we found that service description consists of the main functions of the service, especially the meaning of the supported intents. Hence, using service description for intent causes confusion between the intent description information and other supported intents. Moreover, in categorical slot value prediction task, the most important information is the slot description and value. When adding extra information from service description, it improves marginally on seen service while not generalizing well on unseen services, which indicates the model learns artifacts that are not general useful for unseen services.
Finally, on non-categorical slot tasks, the impact of service description may also varies on datasets. On SG-DST, there are 16 domains and more than 30 services, the rich background context from service description contains both domain and servicespecific information, which seems to help both seen and unseen services. However, on MULTIWOZ 2.2, it hurts the performance on seen service restaurant the most, while improving the performance on the unseen service hotel by 4 points. In this case, it works like a regularizer rather than a definitive clues. Because in MULTIWOZ 2.2, there are only 8 domains, and one service per domain, thus service descriptions just contain domain related information without much extra information, it will not help the model to detect the span for the slot.     mance get dropped when evaluating on heterogeneous descriptions styles. For both heterogeneous and homogeneous evaluation, adding rich description on intent classification tasks seems not bring much benefits than simply using the named-based description. As the discussion in §7.2.1, we believe the name template is good enough to describe the core functionality of an intent in SG-DST dataset.

A.4 More Results of Supplementary Training
Requested Slot. Table 14 shows the results on SG-DST dataset for the requested slots subtask. We ignore the requested slots in MULTIWOZ 2.2 dataset due to its sparsity. Overall, the requested slot subtask are relatively easy, performances on heterogeneous styles still drops but not much. For both heterogeneous and homogeneous evaluation, the performance are not sensible to rich description. Categorical Slot. The results on SG-DST and MULTIWOZ 2.2 dataset are shown in Table 15.  When creating MULTIWOZ 2.2 (Zang et al., 2020), the slots with less than 50 different slot values are classified as categorical slots. We noticed that this leads inconsistent results with SG-DST dataset. It is hard to draw a consistent conclusion on the two datasets. According to the definition, we believe SG-DST are more suitable for categorical slot subtasks, we can further verify our guess when more datasets are created for the research of schemaguided dialog in the future. Non-categorical Slot.
We conduct noncategorical slot identification sub-tasks on both SG-DST and MULTIWOZ 2.2 dataset. The results are shown in Table 16. Overall, the rich description performs better on both homogeneous and heterogeneous evaluations.

A.5.4 Qualitative Analysis On Heterogeneous Evaluation
We conduct qualitative analysis on heterogeneous evaluation on named-based description. Table 17 shows how paraphrasing the named-based description impact on the categorical and non-categorical slot prediction tasks. The first 3 rows at the top are showing the cases of adding modifiers to the name. When the added   Table 16: Joint accuracy of non-categorical slot Subtask with different description styles on unseen services. We train the model on SG-DST and MULTI-WOZ 2.2 datasets respectively for the description style in each row, then evaluate on all 4 different descriptions styles. The mean are the average performance of the remaining 3 descriptions styles. The ∆ means the performance gap between the mean and the homogeneous performance extra modifiers are keywords in other slots, e.g. "attraction" are the keywords also used in "attrac-tion_name". The first shows "attraction_location" may wrongly predicted as "attraction_name". It seems the model does not understand the compound nouns well, and they seems just pay attention to each key words "attraction" and "movie" here. The 3 rows in the middle are showing the cases of using synonyms. Changing "to" to "target", and changing "movie" to "film" will cause extra confusion, which shows the model may fail to the synonyms.
The last 4 rows at the bottom is showing using abbreviations. Changing "number" to "num" will not impact the model, while changing "subtitle" to "sub" may let the model miss the key meaning of subtitle. The performance drop in the later case may be due to the misuse of the "sub" prefix, in En-