A Probabilistic End-To-End Task-Oriented Dialog Model with Latent Belief States towards Semi-Supervised Learning

Structured belief states are crucial for user goal tracking and database query in task-oriented dialog systems. However, training belief trackers often requires expensive turn-level annotations of every user utterance. In this paper we aim at alleviating the reliance on belief state labels in building end-to-end dialog systems, by leveraging unlabeled dialog data towards semi-supervised learning. We pro-pose a probabilistic dialog model, called the LAtent BElief State (LABES) model, where belief states are represented as discrete latent variables and jointly modeled with sys-tem responses given user inputs. Such latent variable modeling enables us to develop semi-supervised learning under the principled variational learning framework. Furthermore, we introduce LABES-S2S, which is a copy-augmented Seq2Seq model instantiation of LABES 1 . In supervised experiments, LABES-S2S obtains strong results on three benchmark datasets of different scales. In utilizing unlabeled dialog data, semi-supervised LABES-S2S signiﬁcantly outperforms both supervised-only and semi-supervised baselines. Remark-ably, we can reduce the annotation demands to 50% without performance loss on MultiWOZ.


Introduction
Belief tracking (also known as dialog state tracking) is an important component in task-oriented dialog systems.The system tracks user goals through multiple dialog turns, i.e. infers structured belief states expressed in terms of slots and values (e.g. in Figure 1), to query an external database (Henderson et al., 2014).Different belief tracking models have been proposed in recent years, either trained independently (Mrkšić et al., 2017;Ren et al., 2018; I need to find a Thai restaurant that's in the south section of the city. There are three restaurants in the south part of town that serve Thai food.Do you have a cuisine preference?belief state DB # match: 3 restaurant-food: Thai ; restaurant-area: south Figure 1: The cues for inferring belief states from user inputs and system responses.The system response reveals the belief state either directly in the form of word repetition (red), or indirectly in the form of the database query result (green) determined by the belief state.Wu et al., 2019) or within end-to-end (E2E) trainable dialog systems (Wen et al., 2017a,b;Liu and Lane, 2017;Lei et al., 2018;Shu et al., 2019;Liang et al., 2020;Zhang et al., 2020).
Existing belief trackers mainly depend on supervised learning with human annotations of belief states for every user utterance.However, collecting these turn-level annotations is labor-intensive and time-consuming, and often requires domain knowledge to identify slots correctly.Building E2E trainable dialog systems, called E2E dialog systems for short, even further magnifies the demand for increased amounts of labeled data (Gao et al., 2020;Zhang et al., 2020).
Notably, there are often easily-available unlabeled dialog data such as between customers and trained human agents accumulated in real-world customer services.In this paper, we are interested in reducing the reliance on belief state annotations in building E2E task-oriented dialog systems, by leveraging unlabeled dialog data towards semi-supervised learning.Intuitively, the dialog data, even unlabeled, can be used to enhance the performance of belief tracking and thus benefit the whole dialog system, because there are cues from user inputs and system responses which reveal the belief states, as shown in Figure 1.
Technically, we propose a latent variable model for task-oriented dialogs, called the LAtent BElief State (LABES) dialog model.The model generally consists of multiple (e.g.T ) turns of user inputs u 1:T and system responses r 1:T which are observations, and belief states b 1:T which are latent variables.Basically, LABES is a conditional generative model of belief states and system responses given user inputs, i.e. p θ (b 1:T , r 1:T |u 1:T ).Once built, the model can be used to infer belief states and generate responses.More importantly, such latent variable modeling enables us to develop semi-supervised learning on a mix of labeled and unlabeled data under the principled variational learning framework (Kingma and Welling, 2014;Sohn et al., 2015).In this manner, we hope that the LABES model can exploit the cues for belief tracking from user inputs and system responses.Furthermore, we develop LABES-S2S, which is a specific model instantiation of LABES, employing copy-augmented Seq2Seq (Gu et al., 2016) based conditional distributions in implementing p θ (b 1:T , r 1:T |u 1:T ).
We show the advantage of our model compared to other E2E task-oriented dialog models, and demonstrate the effectiveness of our semisupervised learning scheme on three benchmark task-oriented datasets: CamRest676 (Wen et al., 2017b), In-Car (Eric et al., 2017) and MultiWOZ (Budzianowski et al., 2018) across various scales and domains.In supervised experiments, LABES-S2S obtains state-of-the-art results on CamRest676 and In-Car, and outperforms all the existing models which do not leverage large pretrained language models on MultiWOZ.In utilizing unlabeled dialog data, semi-supervised LABES-S2S significantly outperforms both supervised-only and prior semisupervised baselines.Remarkably, we can reduce the annotation requirements to 50% without performance loss on MultiWOZ, which is equivalent to saving around 30,000 annotations.

Related Work
On use of unlabeled data for belief tracking.Classic methods such as self-training (Rosenberg et al., 2005), also known as pseudo-labeling (Lee, 2013), has been applied to belief tracking (Tseng et al., 2019).Recently, the pretraining-and-finetuning approach has received increasing interests (Heck et al., 2020;Peng et al., 2020;Hosseini-Asl et al., 2020).The generative model based semisupervised learning approach, which blends unsupervised and supervised learning, has also been studied (Wen et al., 2017a;Jin et al., 2018).Notably, the two approaches are orthogonal and could be jointly used.Our work belongs to the second approach, aiming to leverage unlabeled dialog data beyond of using general text corpus.A related work close to ours is SEDST (Jin et al., 2018), which also perform semi-supervised learning for belief tracking.Remarkably, our model is optimized under the principled variational learning framework, while SEDST is trained with an ad-hoc combination of posterior regularization and auto-encoding.Experimental in §6.2 show the superiority of our model over SEDST.See Appendix A for differences in model structures between SEDST and LABES-S2S.
End-to-end task-oriented dialog systems.Our model belongs to the family of E2E task-oriented dialog models (Wen et al., 2017a,b;Li et al., 2017;Lei et al., 2018;Mehri et al., 2019;Wu et al., 2019;Peng et al., 2020;Hosseini-Asl et al., 2020).We borrow some elements from the Sequicity (Lei et al., 2018) model, such as representing the belief state as a natural language sequence (a text span), and using copy-augmented Seq2Seq learning (Gu et al., 2016).But compared to Sequicity and all its follow-up works (Jin et al., 2018;Shu et al., 2019;Zhang et al., 2020;Liang et al., 2020), a feature in our LABES-S2S model is that the transition between belief states across turns and the dependency between system responses and belief states are well statistically modeled.This new design results in a completely different graphical model structure, which enables rigorous probabilistic variational learning.See Appendix A for details.
Latent variable models for dialog.Latent variables have been used in dialog models.For non task-oriented dialogs, latent variables are introduced to improve diversity (Serban et al., 2017;Zhao et al., 2017;Gao et al., 2019), control language styles (Gao et al., 2019) or incorporate knowledge (Kim et al., 2020) in dialog generation.For task-oriented dialogs, there are prior studies which use latent internal states via hidden Markov models (Zhai and Williams, 2014) or variational autoencoders (Shi et al., 2019) to discover the underlying dialog structures.In Wen et al. (2017a) and Zhao et al. (2019), dialog acts are treated as latent variables, together with variational learning and reinforcement learning, aiming to improve response generation.To the best of our knowledge, we are the first to model belief state as discrete latent variables, and propose to learn these structured representations via the variational principle.

Latent Belief State Dialog Models
We first introduce LABES as a general dialog modeling framework in this section.For dialog turn t, let u t be the user utterance, b t be the current belief state after observed u t and r t be the corresponding system response.In addition, denote c t as the dialog context or model input at turn t, such as c t {r t−1 , u t } as in this work.Note that c t can include longer dialog history depending on specific implementations.Let d t be the database query result which can be obtained through a databaselookup operation given the belief state b t .
Our goal is to model the joint distribution of belief states and system responses given the user inputs, p θ (b 1:T , r 1:T |u 1:T ), where T is the total number of turns and θ denotes the model parameters.
In LABES, we assume the joint distribution follows the directed probabilistic graphical model illustrated in Figure 2, which can be formulated as: where b 0 is an empty state.Intuitively, we refer the conditional distribution p θ (b t |b t−1 ,c t ) as the belief state decoder, and p θ (r t |c t ,b t ,d t ) the response decoder in the above decomposition.Note that the probability p(d t |b t ) is omitted as database result d t is deterministically obtained given b t .Thus the system response can be generated as a three-step process: first predict the belief state b t , then use b t to query the database and obtain d t , finally generate the system response r t based on all the conditions.

Unsupervised Learning
We introduce an inference model q φ (b t |b t−1 , c t , r t ) (described by dash arrows in Figure 2) to approximate the true posterior p θ (b t |b t−1 , c t , r t ).Then we can derive the variational evidence lower bound (ELBO) for unsupervised learning as follows: where and α is a hyperparameter to control the weight of the KL term introduced by Higgins et al. (2017).
Optimizing J un requires drawing posterior belief state samples b 1:T ∼ q φ (b 1:T |u 1:T , r 1:T ) to estimate the expectations.Here we use a sequential sampling strategy similar to Kim et al. (2020), where each b t sampled from q φ (b t |b t−1 , c t , r t ) at turn t is used as the condition to generate the next turn's belief state b t+1 .For calculating gradients with discrete latent variables, which is non-trivial, some methods have been proposed such as using a score function estimator (Williams, 1992) or categorical reparameterization trick (Jang et al., 2017).In this paper, we employ the simple Straight-Through estimator (Bengio et al., 2013), where the sampled discrete token indexes are used for forward computation, and the continuous softmax probability of each token is used for backward gradient calculation.Although the Straight-Through estimator is biased, we find it works pretty well in our experiments, therefore leave the exploration of other optimization methods as future work.Decoding stops when a slot-specific end-ofsentence symbol is generated, which is possible to be the first output if the slot does not appear in the dialog.

Semi-Supervised Learning
When b t labels are available, we can easily train the generative model p θ and inference model q φ via supervised maximum likelihoods: When a mix of labeled and unlabeled data is available, we perform semi-supervised learning using a combination of the supervised objective J sup and the unsupervised objective J un .Specifically, we first pretrain p θ and q φ on small-sized labeled data until convergence.Then we draw supervised and unsupervised minibatches from labeled and unlabeled data and perform stochastic gradient ascent over J sup and J un , respectively.We use supervised pretraining first because training q φ (b t |b t−1 , c t , r t ) to correctly generate slot values and special outputs such as "dontcare" and end-ofsentence tokens as much as possible is important to improve sample efficiency in subsequent semisupervised learning.

LABES-S2S: A Copy-Augmented Seq2Seq Instantiation
In the above probabilistic dialog model LABES, the belief state decoder p θ (b t |b t−1 ,c t ) and the response decoder p θ (r t |c t ,b t ,d t ) can be flexibly im-plemented.In this section we introduce LABES-S2S as an instantiation of the general LABES model based on copy-augmented Seq2Seq conditional distributions (Gu et al., 2016), which is shown in Figure 3(a) and described in the following.The responses are generated through two Seq2Seq processes: 1) decode the belief state given dialog context and last turn's belief state and 2) decode the system response given dialog context, the decoded belief state and database query result.

Belief State Decoder
The belief state decoder is implemented via a Seq2Seq process, as shown in Figure 3(b).Inspired by Shu et al. (2019), we use a single GRU decoder to decode the value for each informable slot separately, feeding the embedding of each slot name as the initial input.In multi-domain setting, the domain name embedding is concatenated with the slot name embedding to distinguish slots with identical names in different domains (Wu et al., 2019).We use two bi-directional GRUs (Cho et al., 2014) to encode the dialog context c t and previous belief state b t−1 into a sequence of hidden vectors h enc ct and h enc b t−1 respectively, which are the inputs to the belief state decoder.As there are multiple slots, and their values can also consist of multiple tokens, we denote the i-th token of slot s by b s,i t .To decode each token b s,i t , we first compute an attention vector over the encoder vectors.Then the attention vector and the embedding of the last decoded token e(b s,i−1 t ) are concatenated and fed into the decoder GRU to get the decoder hidden state h dec b s,i t , denoted as h dec s,i for simplicity: where • denotes vector concatenation.We use the last hidden state of the dialog context encoder as h dec s,0 , and the slot name embedding as e(b s,0 t ).We reuse e(b s,i−1 t ) to form ĥdec s,i to give more emphasis on the slot name embedding and add a dropout layer to reduce overfitting.ĥdec s,i is then used to compute a generative score ψ gen for each token w in the vocabulary V, and a copy score ψ cp for words appeared in c t and b t−1 .Finally, these two scores are combined and normalized to form the final decoding probability following: where W gen and W cp are trainable parameters, v w is the one-hot representation of w, x j is the j-th token in c t ∪ b t−1 and Z is the normalization term.
With copy mechanism, it is easier for the model to extract words mentioned by the user and keep the unchanged values from previous belief state.Meanwhile, the decoder can also generate tokens not appeared in input sequences, e.g. the special token "dontcare" or end-of-sentence symbols.Since the decoding for each slot is independent with each other, all the slots can be decoded in parallel to speed up.
The posterior network q φ (b t |b t−1 , c t , r t ) is constructed through a similar process, where the only difference is that the system response r t is also encoded and used as an additional input to the decoder.Note that the posterior network is separately parameterized with φ.

Response Decoder
The response decoder is implemented via another Seq2Seq process.After obtaining the belief state b t , we use it to query a database to find entities that meet user's need, e.g.Thai restaurants in the south area.The query result d t is represented as a 5dimension one-hot vector to indicate 0, 1, 2, 3 and >3 matched results respectively.We only need the number of matched entities instead of their specific information as the input to the response decoder, because we generate delexicalized responses with placeholders for specific slot values (as shown in Table 4) to improve data efficiency (Wen et al., 2015).The values can be filled through simple rule-based post-processing afterwards.
Instead of directly decoding the response from the belief state decoder's hidden states (Lei et al., 2018), we again use the bi-directional GRU (the one used to encode b t−1 ) to encode the current belief state b t into hidden vectors h enc bt .Then for each token r i t in the response, the decoder state h dec r t,i can be computed as follows: Note that dropout is not used for ĥdec r t,i , since response generation is not likely to overfit, compared to belief tracking in practice.We omit the probability formulas because they are almost the same as in the belief state decoder, except for changing the copy source from c t ∪ b t−1 to c t ∪ b t .

Datasets
We evaluate the proposed model on three benchmark task-oriented dialog datasets: the Cambridge Restaurant (CamRest676) (Wen et al., 2017b), Stanford In-Car Assistant (In-Car) (Eric et al., 2017) and MultiWOZ (Budzianowski et al., 2018), with 676/3031/10438 dialogs respectively.In particular, MultiWOZ is one of the most challenging dataset up-to-date given its multi-domain setting, complex ontology and diverse language styles.As there are some belief state annotation errors in MultiWOZ, we use the corrected version MultiWOZ 2.1 (Eric et al., 2019) in our experiments.See Appendix B for more detailed introductions and statistics.

Evaluation Metrics
We evaluate the model performance under the end-to-end setting, i.e. the model needs to first predict belief states and then generate response based on its own belief predictions.For evaluating belief tracking performance, we use the commonly used joint goal accuracy, which is the proportion of dialog turns where all slot values are correctly predicted.For evaluating response generation, we use BLEU (Papineni et al., 2002) to measure the general language quality.The response quality towards task completion is measured by dataset-specific metrics to facilitate comparison with prior works.For CamRest676 and In-Car, we use Match and SuccF1 following Lei et al. (2018).For MultiWOZ, we use Inform and Success as in Budzianowski et al. (2018), and also a combined score computed through (Inform+Success)×0.5+BLEUas the overall response quality suggested by Mehri et al. (2019).

Baselines
In our experiments, we compare our model to various Dialog State Tracking (DST) and End-to-End (E2E) baseline models.Recently, large-scale pretrained language models (LM) such as BERT (Devlin et al., 2019) and GPT-2 (Radford et al., 2019) are used to improve the performance of dialog models, however in the cost of tens-fold larger model sizes and computations.We distinguish them from light-weighted models trained from scratch in our comparison.
Semi-Supervised Methods: First, we compare with SEDST (Jin et al., 2018) for semi-supervised belief tracking performance.SEDST is also a E2E dialog model based on copy-augmented Seq2Seq learning (see Appendix A for more details).Over unlabled dialog data, SEDST is trained through posterior regularization (PR), where a posterior network is used to model the posterior belief distribution given system responses, and then guide the learning of prior belief tracker through minimizing the KL divergence between them.Second, based on the LABES-S2S model, we compare our variational learning (VL) method to a classic semi-supervised learning baseline, self-training (ST), which performs as its name suggests.Specifically, after supervised pretraining over small-sized labeled dialogs, we run the system to generate pseudo belief states b t over unlabeled dialogs, and then train the response decoder p θ (r t |b t , c t , d t ) in a supervised manner.The gradients will propagate through the discrete belief states by the Straight Through gradient estimator (Bengio et al., 2013) over the computational graph, thus also adjusting the belief state decoder p θ (b t |b t−1 , c t ).

Results and Analysis
In our experiments, we report both the best result and the statistical result obtained from multiple independent runs with different random seeds.Details are described in the caption of each table.The implementation details of our model is available in Appendix C. Results are organized to show the advantage of our proposed LABES-S2S model over existing models ( §6.1) and the effectiveness of our semi-supervised learning method ( §6.2).

Benchmark Performance
We first train our LABES-S2S model under full supervision and compare with other baseline models on the benchmarks.The results are given in Table 1 and Table 2.
As shown in Table 1, LABES-S2S obtains new SOTA joint goal accuracy on CamRest676 and the highest match scores on both CamRest676 and In-Car datasets.Its BLEU scores are also beyond or close to the previous SOTA models.The relatively low SuccF1 is due to that in LABES-S2S, we do not apply additional dialog act modeling and reinforcement fine-tuning to encourage slot token generation as in other E2E models.
Table 2 shows the MultiWOZ results.Among all the models without using large pretrained LMs, LABES-S2S performs the best in belief tracking joint goal accuracy and 3 out of the 4 response generation metrics.Although the response generation performance is not as good as recent GPT-2 based SimpleTOD and SOLOLIST, our model is much smaller and thus computational cheaper.

Semi-Supervised Experiments
In our semi-supervised experiments, we first split the data according to a fixed proportion, then train the models using only labeled data (SupOnly), or using both labeled and unlabeled data (Semi) with the proposed variational learning method (Semi-VL), self-training (Semi-ST) and posterior regularization (Semi-PR) introduced in §5.3 respectively.We conduct experiments with 50% and 25% labeled data on CamRest676 and In-Car following Jin et al. (2018), and change the labeled data proportion from 10% to 100% on MultiWOZ.The results are shown in Table 3 and Figure 4.
In Table 3, we can see that semi-supervised learning methods outperform the supervised learning baseline consistently in all experiments for the two datasets.In particular, the improvement of Semi-VL over SupOnly on our model is significantly larger than Semi-PR over SupOnly on SEDST in most metrics, and Semi-VL obtains a joint goal accuracy of 1.3%∼3.9%higher over Semi-ST.These results indicate the superiority of our LABES modeling framework in utilizing unlabeled data over other semi-supervised baselines.Since LABES mainly improves modeling of belief states, it is more relevant to examine the belief tracking metrics such as joint goal accuracy and match rate (partly determined by the belief tracking accuracy).Note that Semi-VL and Semi-ST are fed with the same set of system responses, thus they obtain similar SuccF1 and BLEU scores in Table 3, which mainly measure the response quality.
The results on MultiWOZ shown in Figure 4 also support the above conclusions.From the plot of metric scores w.r.t labeling proportions, we can see how many labels can be reduced clearly.Our LABES-S2S model trained with Semi-VL obtains a joint goal accuracy of 49.47% and a combined score of 89.21 on only 50% of labeled data, which is very close to 50.05% and 88.01 obtained under 100% supervision.This indicates that we can reduce 50% of labels without losing performance, which results in reducing around 30,000 belief state annotations given the size of MultiWOZ.Moreover, it can be seen from Figure 4 that our Semi-VI can improve the belief tracking and response generation performance when labeling only 10% of dialogues, and the smaller amount of labels, the larger gain obtained by Semi-VI.

Conclusion and Future Work
In this paper we are interested in reducing belief state annotation cost for building E2E task-oriented dialog systems.We propose a conditional generative model of dialogs -LABES, where belief states are modeled as latent variables, and unlabeled dialog data can be effectively leveraged to improve belief tracking through semi-supervised variational learning.Furthermore, we develop LABES-S2S, which is a copy-augmented Seq2Seq model instantiation of LABES.We show the strong benchmark performance of LABES-S2S and the effectiveness of our semi-supervised learning method on three benchmark datasets.In our experiments on Multi-WOZ, we can save around 50%, i.e. around 30,000 belief state annotations without performance loss.
There are some interesting directions for future work.First, the LABES model is general and can be enhanced by, e.g.incorporating largescale pre-trained language models, allowing other options for the belief state decoder and the response decoder such as Transformer based.Second, we can analogously introduce dialog acts a 1:T as latent variables to define the joint distribution p θ (b 1:T , a 1:T , r 1:T |u 1:T ), which can be trained with semi-supervised learning and reinforcement learning as well.model introduces an additional b t encoder and uses the encoder hidden states h enc bt to generate system response and next turn's belief state, thus the conditional probability p θ (r t |b t , c t ) and state transition probability p θ (b t |b t−1 , c t ) are well defined by two complete Seq2Seq processes.
Second, the difference in models can also be clearly seen from the probabilistic graphical model structures as shown in Figure 6.LABES-S2S is a conditional generative model where the belief states are latent variables.In contrast, Sequicity/SEDST do not treat the belief states as latent variables.
Third, the above differences in models lead to differences in learning methods for Sequicity/SEDST and LABES-S2S.Sequicity can only be trained on labeled data via multi-task supervised learning.SEDST resorts to an ad-hoc combination of posterior regularization and auto-encoding for semi-supervised learning.Remarkably, LABES-S2S is optimized under the principled variational learning framework.and belief states are modeled very weakly in Sequicity/SEDST, only owing to the copy mechanism.For simpliciy, we omit such relations in both Figure 5 and 6.

B Datasets
In our experiments, we evaluate different models on three benchmark task-oriented datasets with different scales ontology complexities (Table 6).The Cambridge Restaurant (CamRest676) dataset (Wen et al., 2017b) contains single-domain dialogs where the system assists users to find a restaurant.The Stanford In-Car Assistant (In-Car) dataset (Eric et al., 2017) consists of dialogs between a user and a in-car assistant system covering three tasks: calendar scheduling, weather information retrieval and point-of-interest navigation.The MultiWOZ (Budzianowski et al., 2018) dataset is a large-scale human-human multi-domain dataset containing dialogs in seven domains including attraction, hotel, hospital, police, restaurant, train, and taxi.It is more challenging due to its multi-domain setting, complex ontology and diverse language styles.As there are some belief state annotation errors in Mul-tiWOZ, we use the corrected version MultiWOZ 2.1 (Eric et al., 2019) in our experiments.We follow the data preprocessing setting in Zhang et al. (2020), whose data cleaning is developed based on Wu et al. (2019).

C Implementation Details
In our implementation of LABES-S2S, we use 1-layer bi-directinonal GRU as encoders and standard GRU as decoders.
The hidden sizes are 100/100/200, vocabulary sizes are 800/1400/3000, and learning rates of Adam optimizer are 3e −3 /3e −3 /5e −5 for CamRest676/In-Car/MultiWOZ respectively.In all experiments, the embedding size is 50 and we use GloVe (Pennington et al., 2014) to initialize the embedding matrix.Dropout rate is 0.35 and λ for variational inferece is 0.5, which are selected via grid search from {0. 1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5} and {0.1, 0.3, 0.5, 0.7, 1.0 learning rate decays by half every 2 epochs if no improvement is observed on development set.Training early stops when no improvement is observed on development set for 4 epochs.We use 10-width beam search for CamRest676 and greedy decoding for other datasets.All the models are trained on a NVIDIA Tesla-P100 GPU.
Structure of the belief state decoder.

Figure 3 :
Figure3: (a) shows the computational graph of LABES-S2S.In (b), rectangles in different colors denote different word embeddings, and the embedding of domain names and slot names are concatenated as the initial input.Note that the same (i.e.weight-tied) decoder is shared across all slots.Decoding stops when a slot-specific end-ofsentence symbol is generated, which is possible to be the first output if the slot does not appear in the dialog.

Figure 4 :
Figure 4: Performance of different methods w.r.t labeling proportion on MultiWOZ 2.1.The dash line corresponds to the baseline trained with 100% labeled data.

Figure 6 :
Figure 6: Comparison of probabilistic graphical model structures.
, and dash arrows describe the approximate posterior model q φ .Note that we set c t {r t−1 , u t } in our model, and omit u t from the graph for simplicity.

Table 1 :
Results on CamRest676 and In-Car.The model with the highest joint goal accuracy on the development set of CamRest676 is shown as the best result, as similarly reported in prior work.Statistical results are reported as the mean and standard deviation of 5 runs.* denotes results obtained by our run of the open-source code.

Table 2 :
Results on MultiWOZ 2.1.The model with the highest validation joint goal accuracy is shown as the best result, as similarly reported in prior work.The standard deviations for the statistical results are in Table5in the appendix.
* denotes results obtained by our run of the open-source code.

Table 3 :
(Jin et al., 2018)aining with only labeled data, and Semi denotes training with both labeled and unlabeled data in each dataset.ST, VL and PR denote self-training, variational learning and posterior regularization(Jin et al., 2018)respectively.Results of SEDST are obtained by our run of the open-source code.All the scores in this table are the mean from 5 runs.
: I am looking for an expensive restaurant that serves Russian food.b 1 : {food: Russian, pricerange: expensive} r 1 : There is no expensive restaurant that serves Russian food.Can I help you with anything else?u 2 : Yes, do you have British type food ?b 2 : {food: British, pricerange: expensive} r 2 : Yes, there are 6 options.Does the part of town matter?

Table 4 :
Comparison of two example turns generated by our model with supervised learning only (SupOnly) and semi-supervised variational learning (Semi-VL).

Table 6 :
Statistics of dialog datasets.Info and Req are shorthands for informable and requestable respectively.