UniConv: A Unified Conversational Neural Architecture for Multi-domain Task-oriented Dialogues

Building an end-to-end conversational agent for multi-domain task-oriented dialogue has been an open challenge for two main reasons. First, tracking dialogue states of multiple domains is non-trivial as the dialogue agent must obtain complete states from all relevant domains, some of which might have shared slots among domains as well as unique slots specifically for one domain only. Second, the dialogue agent must also process various types of information across domains, including dialogue context, dialogue states, and database, to generate natural responses to users. Unlike the existing approaches that are often designed to train each module separately, we propose"UniConv"-- a novel unified neural architecture for end-to-end conversational systems in multi-domain task-oriented dialogues, which is designed to jointly train (i) a Bi-level State Tracker which tracks dialogue states by learning signals at both slot and domain level independently, and (ii) a Joint Dialogue Act and Response Generator which incorporates information from various input components and models dialogue acts and target responses simultaneously. We conduct comprehensive experiments in dialogue state tracking, context-to-text, and end-to-end settings on the MultiWOZ2.1 benchmark, achieving superior performance over competitive baselines in all tasks. Our code and models will be released.


Introduction
A conventional approach to task-oriented dialogues is to solve four distinct tasks: (1) natural language understanding (NLU) which parses user utterance into a semantic frame, (2) dialogue state tracking (DST) which updates the slots and values from semantic frames to the latest values for knowledge base retrieval, (3) dialogue policy which determines an appropriate dialogue act for the next system response, and (4) response generation which gener-ates a natural language sequence conditioned on the dialogue act. This traditional pipeline modular framework has achieved remarkable successes in task-oriented dialogues (Wen et al., 2017;Liu and Lane, 2017;Williams et al., 2017;Zhao et al., 2017). However, such kind of dialogue system is not fully optimized as the modules are loosely integrated and often not trained jointly in an end-to-end manner, and thus may suffer from increasing error propagation between the modules as the complexity of the dialogues evolves.
A typical case of a complex dialogue setting is when the dialogue extends over multiple domains. A dialogue state in a multi-domain dialogue should include slots of all applicable domains up to the current turn (See Table 1). Each domain can have shared slots that are common among domains or unique slots that are not shared with any. Directly applying single-domain DST to multi-domain dialogues is not straightforward because the dialogue states extend to multiple domains. A possible approach is to process a dialogue of N D domains multiple times, each time obtaining a dialogue state of one domain. However, this approach does not allow learning co-reference in dialogues in which users can switch from one domain to another.
As the number of dialogue domains increases, traditional pipeline approaches propagate errors from dialogue states to dialogue policy and subsequently, to natural language generator. Recent efforts (Eric et al., 2017;Madotto et al., 2018;Wu et al., 2019b) address this problem with an integrated sequence-to-sequence structure. These approaches often consider knowledge bases as memory tuples rather than relational entity tables. While achieving impressive performance, these approaches are not scalable to large-scale knowledgebases, e.g. thousands of entities, as the memory cost to query entity attributes increases substantially. Another limitation in these approaches is the absence of dialogue act modelling. Dialogue act is particularly important in task-oriented dialogues as it determines the general decision towards task completion before a dialogue agent can materialize it into natural language response (See Table 1).
To tackle the challenges in multi-domain taskoriented dialogues while reducing error propagation among dialogue system modules and keeping the models scalable, we propose UniConv, a unified neural network architecture for end-to-end dialogue systems. UniConv consists of a Bi-level State Tracking (BDST) module which embeds natural language understanding as it can directly parse dialogue context into a structured dialogue state rather than relying on the semantic frame output from an NLU module in each dialogue turn. BDST implicit models and integrates slot representations from dialogue contextual cues to directly generate slot values in each turn and thus, remove the need for explicit slot tagging features from an NLU. This approach is more practical than the traditional pipeline models as we do not need slot tagging annotation. Furthermore, BDST tracks dialogue states in dialogue context in both slot and domain levels. The output representations from two levels are combined in a late fusion approach to learn multi-domain dialogue states. Our dialogue state tracker disentangles slot and domain representation learning while enabling deep learning of shared representations of slots common among domains.
UniConv integrates BDST with a Joint Dialogue Act and Response Generator (DARG) that simultaneously models both dialogue acts and generates system responses by learning a latent variable representing dialogue acts and semantically conditioning output response sequences on this latent variable. The multi-task setting of DARG allows our models to model dialogue acts while utilizing the distributed representations of dialogue acts, rather than hard discrete output values from a dialogue policy module, on output response tokens. Our response generator incorporates information from dialogue input components and intermediate representations progressively over multiple attention steps. The output representations are refined after each step to obtain high-resolution signals needed to generate appropriate dialogue acts and responses. We combine both BDST and DARG for end-to-end neural dialogue systems, from input dialogues to output system responses.
We evaluate our models on the large-scale Mul-tiWOZ benchmark (Budzianowski et al., 2018), and compare with the existing methods in DST, context-to-text generation, and end-to-end settings. The promising performance in all tasks validates the efficacy of our method.
2 Related Work 2.1 Dialogue State Tracking Traditionally, DST models are designed to track states of single-domain dialogues such as WOZ (Wen et al., 2017) andDSTC2 (Henderson et al., 2014a) benchmarks. There have been recent efforts that aim to tackle multi-domain DST such as (Ramadan et al., 2018;Lee et al., 2019;Wu et al., 2019a;Goel et al., 2019). These models can be categorized into two main categories: Fixed vocabulary models (Zhong et al., 2018;Ramadan et al., 2018;Lee et al., 2019), which assume known slot ontology with a fixed candidate set for each slot. On the other hand, open-vocabulary models (Lei et al., 2018;Wu et al., 2019a;Gao et al., 2019;Ren et al., 2019;Le et al., 2020) derive the candidate set based on the source sequence i.e. dialogue history, itself. Our approach is more related to the open-vocabulary approach as we aim to generate unique dialogue states depending on the input di- alogue. Different from previous generation-based approaches, our state tracker can incorporate contextual information into domain and slot representations and explicitly learns dependencies among slots and domains independently.

Context-to-Text Generation
This task was traditionally solved by two separate dialogue modules: Dialogue Policy (Peng et al., 2017(Peng et al., , 2018 and NLG (Wen et al., 2016;Su et al., 2018). Recent work attempts to combine these two modules to directly generate system responses with or without modeling dialogue acts. Zhao et al.

End-to-End Dialogue Systems
In this task, conventional approaches combine Natural Language Understanding (NLU), DST, Dialogue Policy, and NLG, into a pipeline architecture (Wen et al., 2017;Bordes et al., 2016;Liu and Lane, 2017;Li et al., 2017;Liu and Perez, 2017;Williams et al., 2017;Zhao et al., 2017). Another framework does not explicitly modularize these components but incorporate them through a sequence-to-sequence framework (Serban et al., 2016;Lei et al., 2018;Yavuz et al., 2019) and a memory-based entity dataset of triplets (Eric and Manning, 2017;Eric et al., 2017;Madotto et al., 2018;Qin et al., 2019;Gangi Reddy et al., 2019;Wu et al., 2019b). These approaches bypass dialogue state and/or act modeling and aim to generate output responses directly. They achieve impressive success in generating dialogue responses in open-domain dialogues with unstructured knowledge bases. However, in a task-oriented setting with an entity dataset, they might suffer from an explosion of memory size when the number of entities increases, especially when there are multiple datasets from multiple dialogue domains. Our work is more related to the traditional pipeline strategy as we allow our model to explicitly learn dialogue states and acts and enable efficient database search. We integrate our dialogue models by unifying two major components rather than using the traditional four-module architecture, to alleviate error propagation from upstream to downstream components. Similar to our work, Shu et al. (2019) proposes a novel end-to-end architecture for task-oriented dialogue systems. Different from this work, our model facilitates multi-domain state tracking and allows learning dialogue acts during response generation.

Method
The input consists of dialogue context of t−1 turns, each including a pair of user utterance U and system response R, (U 1 , R 1 ), ..., (U t−1 , R t−1 ), and the user utterance at current turn U t . A taskoriented dialogue system aims to generate the next response R t , that is not only appropriate to the dialogue context, but also contains the correct information relevant to the current dialogue domains. The information is typically queried from a database based on the user's provided information i.e. inform slots tracked by a DST. We assume access to a database of all domains with each column corresponding to a specific slot being tracked. We denote the intermediate output, including the dialogue state of current turn B t and dialogue act as A t . We denote the list of all domains D = (d 1 , d 2 , ...), all slots S = (s 1 , s 2 , ...), and all acts A = (a 1 , a 2 , ...). We also denote the list of all (domain, slot) pairs as DS = (ds 1 , ds 2 , ...). Note that DS ≤ D × S as some slots might not be applicable in all domains. Without loss of generalization, given the current dialogue turn t, we represent each text input as a sequence of tokens, each of which is a unique token index from a vocabulary set V : dialogue context X ctx , current user utterance X utt , and target system response X res . Similarly, we also represent the list of domains as X D and the list of slots as X S . In DST, following similar approaches in (Lei et al., 2018; Budzianowski and Vulić, 2019), we consider the raw text form of dialogue state of the previous turn B t−1 using the following template: value1 slot1 value2 slot2 ... domain1 value1 slot1 value2 slot2 ... domain2 ... In the context-to-text setting which assumes access to the ground-truth dialogue states of current turn B t , we also consider the raw text form using the same template. The dialogue state of the previous and current turn can then be represented as a sequence of tokens X prev st and X curr st respectively. For a fair comparison with current approaches, during inference, we use the model predicted dialogue statesX prev st and do not use X curr st in DST and end-to-end tasks. For X res , following current approaches of response generation (Wen et al., 2015;Budzianowski et al., 2018), we consider the delexicalized target response as input by replacing tokens of slot values by their corresponding generic tokens to allow learning value-independent parameters. We denote the delexicalized response X dl res . Our model consists of 3 major components (See Figure 1). First, Encoders encode all text input into continuous representations. To make it consistent, we encode all input with the same embedding dimension. Secondly, our Bi-level State Tracker (BDST) is used to detect contextual dependencies to generate dialogue states. The DST includes 2 modules for slot-level and domain-level representation learning. Each module comprises attention layers to project domain or slot representations and incorporate important information from dialogue context, dialogue state of the previous turn, and current user utterance. The outputs of the two modules are combined to create domain-slot joint feature representations. They are used as a contextaware vector to decode the corresponding inform or request slots in each domain. Lastly, our Joint Dialogue Act and Response Generator (DARG) projects the target system response representations and enhances them with information from various dialogue components. Our response generator can also learn a latent representation to generate dialogue acts, which condition all target tokens during each generation step.

Encoders
An encoder encodes a text sequence X to a sequence of continuous representation Z ∈ R L X ×d . L X is the length of sequence X and d is the embedding dimension. Each encoder includes a token-level embedding layer. The embedding layer is a trainable embedding matrix E ∈ R V ×d . Each row represents a token in the vocabulary set V as a d-dimensional vector. We denote E(X) as the embedding function that transform the sequence X by looking up the respective token index: Z emb = E(X) ∈ R L X ×d . We inject the posi-tional attribute of each token as similarly adopted in (Vaswani et al., 2017). The positional encoding is denoted as P E. The final embedding is the element-wise summation between token-embedded representations and positional encoded representations with layer normalization (Ba et al., 2016).
The encoder outputs include representations of dialogue context Z ctx , current user utterance Z utt , and target response Z dl res . We also encode the dialogue states of the previous turn and current turn and obtain Z prev st and Z curr st respectively. We encode X S and X D using only token-level embedding layer: Z S = LayerNorm(E(X S )) and Z D = LayerNorm(E(X D )). During training, we shift the target response by one position to the left side to allow auto-regressive prediction in each generation step. We share the embedding matrix E to encode all text tokens except for tokens of target responses. We use a separate embedding matrix E res to encode target tokens as the delexicalized outputs contain different semantic dynamics from the original source sequences.

Slot-level DST
We use a transformer-based neural network (Vaswani et al., 2017), consisting of dot-product attention from one representation to another, together with skip connection, to integrate dialogue contextual information into each slot representations. We denote Att(Z 1 , Z 2 ) for attention from Z 2 on Z 1 : The attention mechanism can be enhanced with multi-head structure in which h att linear transformation layers are used to project each input representation to different feature spaces. The output representations of equation 5 from all heads are then concatenated as the final output. Slot Self-Attention. We first enable models to process all slot representations together rather than separately as in previous DST models (Ramadan et al., 2018;Wu et al., 2019a). This strategy allows our models to explicitly learn dependencies between all pairs of slots. Many pairs of slots could exhibit correlation such as time-wise relation ("departure_time" and "arrival_time"). We obtain Z dst SS = Att(Z S , Z S ) ∈ R S ×d . Slot→Dialogue Attention. We incorporate the dialogue information by learning dependencies between each slot representation and each token in the dialogue history. Traditional approaches consider all dialogue history as a single sequence, i.e. combining both X ctx and X utt . However, we separate them into two inputs because the information in X utt is usually more important to generate responses while X ctx includes more background information. We then obtain Following (Lei et al., 2018), we incorporate dialogue state of the previous turn B t−1 which is a more compact representation of dialogue context. Hence, we can replace the full dialogue context to only R t−1 as the remaining part is represented in B t−1 . This approach avoids taking in all dialogue history and is scalable as the conversation grows longer. We then add the attention layer to obtain . Using dialogue state of the previous turn provides a more information-intensive input yet requires less memory than processing a full-length dialogue context input. We further improve the feature representations by learning over N dst S times. At the end of each round, the representation Z dst S,utt is used as Z 2 in equations 1 to 5 in the next attention block. We denote the final output Z dst S .

Domain-level DST
We adopt a similar architecture to learn domainlevel representations. We can consider the representations learned in this module exhibiting global relationships while slot-level representations containing local dependencies. Domain Self-Attention. First, our DST can capture dependencies between all pairs of domains. For example, some domains such as "hotel", "attraction", and "taxi" are usually combined in a dialogue episode of travel planning. We then obtain Z dst DD = Att(Z D , Z D ) ∈ R D ×d . Domain→Dialogue Attention. We allow models to capture dependencies between each domain representation and each token in dialogue context and current user utterance. By segregating dia-logue context and current utterance, our models can potentially detect changes of dialogue domains from past turns to the current turn. Especially in multi-domain dialogues, users can switch from one domain to another and the next system response should address the latest domain. We then ob- sequentially. Different from slot-level DST, we do not use dialogue state of the previous turn as input because we expect global dependencies in domain representations are easier to detect. Similar to the slot-level module, we refine feature representations over N dst D times and denote the final output as Z dst D .

Domain-Slot DST
We combined domain and slot representations by expanding the tensors to identical dimensions i.e. D × S × d. We then apply Hadamard product, resulting in domain-slot joint features Z dst DS ∈ R D × S ×d . We then apply a self-attention layer to allow learning of dependencies between joint domain-slot features: In this attention, we mask the intermediate representations in positions of invalid domain-slot pairs. Compared to previous work such as (Wu et al., 2019a), we adopt a late fusion method whereby domain and slot representations are integrated in deeper layers.

State Generator
The representations Z dst are used as context-aware representations to decode individual dialogue states. Given a domain index i and slot index j, the feature vector Z dst [i, j, :] ∈ R d is used to generate value of the corresponding (domain, slot) pair. The vector is used as an initial hidden state for an RNN decoder to decode an inform slot value. Given the k-th (domain, slot) pair and decoding step l, the output hidden state in each recurrent step h kl is passed through a linear transformation with softmax to obtain output distribution over vocabulary set V .
For request slot of k-th (domain,slot) pair, we pass the corresponding vector Z dst vector through a linear layer with sigmoid activation to predict a value of 0 or 1.
Optimization. The DST is optimized by the crossentropy loss functions of inform and request slots: We propose a stacked-attention architecture that sequentially learns dependencies between each token in target responses with each dialogue component representation. First, we obtain Z gen res = Att(Z res , Z res ) ∈ R Lres×d . This attention layer can learn semantics within the target response to construct a more semantically structured sequence. We then use attention to capture dependencies in background information contained in dialogue context and user utterance. The output are Z gen ctx = Att(Z ctx , Z gen res ) ∈ R Lres×d and Z gen utt = Att(Z utt , Z gen ctx ) ∈ R Lres×d sequentially. Response→State and DB Attention. To incorporate information of dialogue states and DB results, we apply attention steps to capture dependencies between each response token representation and state or DB representation. Specifically, we first obtain Z gen dst = Att(Z dst , Z gen utt ) ∈ R Lres×d . In the context-to-text setting, as we directly use the ground-truth dialogue states, we simply replace Z dst with Z curr st . Then we obtain Z gen db = Att(Z db , Z gen dst ) ∈ R Lres×d . These attention layers capture the information needed to generate tokens that are towards task completion and supplement the contextual cues obtained in previous attention layers. We let the models to progressively capture these dependencies for N gen times and denote the final output as Z gen . The final output is passed to a linear layer with softmax activation to decode system responses auto-regressively. P res = Softmax(Z gen W gen ) ∈ R Lres× Vres Dialogue Act Modeling. We couple response generation with dialogue act modeling by learning a latent variable Z act ∈ R d . We place the vector in the first position of Z res , resulting in Z res+act ∈ R (Lres+1)×d . We then pass this tensor to the same stacked attention layers as above. By adding the latent variable in the first position, we allow our model to semantically condition all downstream tokens from second position, i.e. all tokens in the target response, on this latent variable. The output representation of the latent vector i.e. first row in Z gen , incorporates contextual signals accumulated from all attention layers and is used to predict dialogue acts. We denote this representation as Z gen act and pass it through a linear layer to obtain a multi-hot encoded tensor. We apply Sigmoid on this tensor to classify each dialogue act as 0 or 1.
Optimization. The response generator is jointly trained by the cross-entropy loss functions of generated responses and dialogue acts: 4 Experiments

Dataset
We used the multi-domain dialogue corpus Multi-WOZ 2.0 (Budzianowski et al., 2018) as well as MultiWOZ 2.1 (Eric et al., 2019) which includes corrected state labels for the DST task. From the dialogue state annotation of the training data, we identified all possible domains and slots. We identified D = 7 domains and S = 30 slots, including 19 inform slots and 11 request slots. We also identified A = 32 acts. The corpus includes 8,438 dialogues in the training set and 1,000 in each validation and test set. On average, each dialogue has 1.8 domains and extends over 13 turns. The benchmark also includes an entity DB which can be constructed as a self-contained SQL database engine. We detail additional information of data pre-processing procedures, domains, slots, and entity DBs, in the Appendix A.

Experiment Setup
We select d = 256, h att = 8, N dst S = N dst D = N gen = 3. We employed dropout (Srivastava et al., 2014) of 0.3 and label smoothing (Szegedy et al., 2016) on target system responses during training. We adopt a teacher-forcing training strategy by simply using the ground-truth inputs of dialogue state of the previous turn and the gold DB representations. During inference in DST and end-to-end tasks, we decode system responses sequentially turn by turn, using the previously decoded state as input in the current turn, and at each turn, using the new predicted state to query DBs. For the contextto-text generation task, ground-truth dialogue states and DBs are used during both training and inference. We train all networks with Adam optimizer (Kingma and Ba, 2015) and a learning rate schedule similarly adopted by (Vaswani et al., 2017). We used batch size 32 and tuned the warmup_steps from 10K to 15K training steps. All models are trained up to 30 epochs and the best models are selected based on validation loss. We used a greedy approach to decode all slots and beam search with beam size 5 and a length penalty of 1.0 to decode responses. To evaluate the models, we use the following metrics: (1)

Results
DST. We test our state tracker (i.e. using only L dst ) and compare the performance with the baseline models in Table 2 (Refer to the Appendix B for a description of DST baselines). Our model achieves the SOTA performance in MultiWOZ2.1 corpus. Our model can outperform fixed-vocabulary approaches such as HJST and FJST, showing the advantage of generating unique slot values rather than relying on a slot ontology with a fixed set of candidates. DST Reader does not perform as well and we note that many slot values are not easily expressed as a text span in source text inputs. DST approaches that separate domain and slot representations such as TRADE reveal competitive performance. However, our approach has better performance as we adopt a late fusion strategy to explicitly obtain more fine-grained contextual de-     pendencies in each domain and slot representation. In this aspect, our model is related to TSCP which decodes output state sequence auto-regressively. However, TSCP attempts to learn domain and slot dependencies implicitly and the model is limited by selecting the maximum output state length (which can vary significantly in multi-domain dialogues).
Context-to-Text Generation. We compare with existing baselines in  (Wu et al., 2019b), the restaurant domain alone has over 1,000 memory tuples of (Subject, Relation, Object). Ablation. We experiment with several model variants in Table 5 and have the following observations: (1) Using a single-level DST (by considering S = DS and N dst D = 0) performs worse than the Bi-level DST. Using the dual architecture also improves the latency in each attention layers as typically D + S DS .
(2) Using B t−1 and only the last user utterance as the dialogue context performs as well as using B t−1 and a fulllength dialogue history. Using only the last user utterance, however, reduces the training time as the number of tokens in a full-length dialogue history is much larger than that of a dialogue state (particularly as the conversation evolves over many turns). (3) We note that removing the loss function to learn the dialogue act latent variable can hurt the generation performance. This reveals the benefit of enforcing a semantic condition on each token of the target response to steer the conversation towards the right direction for task completion. (4) In both state tracker and response generator modules, we note that learning feature representations through deeper attention networks can improve the quality of predicted states and system responses. This is consistent with our DST performance as compared to baseline models of shallow networks. (5) Lastly, our model achieves better performance as the number of attention heads increases, by learning more high-resolution dependencies.

Domain-dependent Results
DST. For state tracking, the metrics are calculated for domain-specific slots of the corresponding domain at each dialogue turn. We also report the DST separately for multi-domain and single-domain dialogues to evaluate the challenges in multi-domain dialogues and our DST performance gap as compared to single-domain dialogues. From Table 6, our DST performs consistently well in the 3 domains attraction, restaurant, and train domains.
However, the performance drops in the taxi and hotel domain, significantly in the taxi domain. We note that dialogues with the taxi domain is usually not single-domain but typically entangled with other domains. Secondly, we observe that there is a significant performance gap of about 10 points absolute score between DST performances in singledomain and multi-domain dialogues. State tracking in multi-domain dialogues is, hence, could be further improved to boost the overall performance.   We conduct qualitative analysis and the insights can be seen in Appendix C. A Data Pre-processing First, we delexicalize each target system response sequence by replacing the matched entity attribute that appears in the sequence to the canonical tag domain_slot . For example, the original target response 'the train id is tr8259 departing from cambridge' is delexicalized into 'the train id is train_id departing from train_departure'. We use the provided entity databases (DBs) to match potential attributes in all target system responses. To construct dialogue history, we keep the original version of all text, including system responses of previous turns, rather than the delexicalized form. We split all sequences of dialogue history, user utterances of the current turn, dialogue states, and delexicalized target responses, into case-insensitive tokens. We share the embedding weights of all source sequences, including dialogue history, user utterance, and dialogue states, but use a separate embedding matrix to encode the target system responses.
We summarize the number of dialogues in each domain in Table 8. For each domain, a dialogue is selected as long as the whole dialogue (i.e. singledomain dialogue) or parts of the dialogue (i.e. in multi-domain dialogue) is involved with the domain. For each domain, we also build a set of possible inform and request slots using the dialogue state annotation in the training data. The details of slots and database in each domain can be seen in Table 9. The DBs of 3 domains taxi, police, and hospital are not available as part of the benchmark.

B Baselines
We describe our baseline models in DST, contextto-text generation, and end-to-end dialogue tasks.
B.1 DST FJST and HJST (Eric et al., 2019). These models adopt a fixed-vocabulary DST approach. Both models include encoder modules (either bidirectional LSTM or hierarchical LSTM) to encode the dialogue history. The models pass the context hidden states to separate linear transformation to obtain final vectors to predict individual slots separately. The output vector is used to measure a score of each candidate from a predefined candidate set. DST Reader (Gao et al., 2019). This model considers the DST task as a reading comprehension task and predicts each slot as a span over tokens within dialogue history. DST Reader utilizes attentionbased neural networks with additional modules to predict slot type and carryover probability. TSCP (Lei et al., 2018). The model adopts a sequence-to-sequence framework with a pointer network to generate dialogue states. The source sequence is a combination of the last user utterance, dialogue state of the previous turn, and user utterance. To compare with TSCP in a multi-domain task-oriented dialogue setting, we adapt the model to multi-domain dialogues by formulating the dialogue state of the previous turn similarly as our models. We reported the performance when the maximum length of the output dialogue state sequence L is set to 20 tokens (original default parameter is 8 tokens but we expect longer dialogue state in MultiWOZ benchmark and selected 20 tokens). HyST (Goel et al., 2019). This model combines the advantage of fixed-vocabulary and openvocabulary approaches. The model uses an openvocabulary approach in which the set of candidates of each slot is constructed based on all word ngrams in the dialogue history. Both approaches are applied in all slots and depending on their performance in the validation set, the better approach is used to predict individual slots during test time. TRADE (Wu et al., 2019a). The model adopts a sequence-to-sequence framework with a pointer network to generate individual slot token-by-token. The prediction is additionally supported by a slot gating component that decides whether the slot is "none", "dontcare", or "generate". When the gate of a slot is predicted as "generate", the model will generate value as a natural output sequence for that slot. NADST (Le et al., 2020). This is the current stateof-the-art model on the MultiWOZ2.1 dataset. The model proposes a non-autoregressive approach for dialogue state tracking which enables learning dependencies between domain-level and slot-level representations as well as token-level representations of slot values.  ) and adapts to the context-to-text generation setting in task-oriented dialogues. All input components, including dialogue state and database state, are transformed into raw text format and concatenated as a single sequence. The sequence is used as input to a pre-trained GPT-2 model which is then fine-tuned with MultiWOZ data. DAMD (Zhang et al., 2019). This is the current state-of-the-art model for context-to-text generation task in MultiWOZ 2.1. This approach augments training data with multiple responses of similar context. Each dialogue state is mapped to multiple valid dialogue acts to create additional state-act pairs.

B.3 End-to-End
TSCP (Lei et al., 2018). In addition to the DST task, we evaluate TSCP as an end-to-end dialogue system that can do both DST and NLG. We adapt the models to the multi-domain DST setting as described in Section B.1 and keep the original response decoder. Similar to the DST component, the response generator of TSCP also adopts a pointer network to generate tokens of the target system responses by copying tokens from source sequences. In this setting, we test TSCP with two settings of the maximum length of the output dialogue state sequence: L = 8 and L = 20. HRED-TS (Peng et al., 2019). This model adopts a teacher-student framework to address multidomain task-oriented dialogues. Multiple teacher networks are trained for different domains and intermediate representations of dialogue acts and output responses are used to guide a universal student network. The student network uses these representations to directly generate responses from dialogue context without predicting dialogue states.

C Qualitative Analysis
We examine an example of dialogue in the test data and compare our predicted outputs with the baseline TSCP (L = 20) (Lei et al., 2018) and the ground truth. From Figure 4, we observe that both our predicted dialogue state and system response are more correct than the baseline. Specifically, our dialogue state can detect the correct type slot in the attraction domain. As our dialogue state is correctly predicted, the queried results from DB is also more correct, resulting in better response with the right information (i.e. 'no attraction available'). In Figure 5, we show the visualization of domain-level and slot-level attention on the user utterance. We notice important tokens of the text sequences, i.e. 'entertainment' and 'close to', are attended with higher attention scores. Besides, at domain-level attention, we find a potential additional signal from the token 'restaurant', which is also the domain from the previous dialogue turn. We also observe that attention is more refined throughout the neural network layers. For example, in the domain-level processing, compared to the 2 nd layer, the 4 th layer attention is more clustered around specific tokens of the user utterance. In Table 10 and 11, we reported the complete output of this example dialogue. Overall, our dialogue agent can carry a proper dialogue with the user throughout the dialogue steps. Specifically, we observed that our model can detect new domains at dialogue steps where the domains are introduced e.g. attraction domain at the 5 th turn and taxi domain at the 8 th turn. The dialogue agent can also detect some of the co-references among the domains. For example, at the 5 th turn, the dialogue agent can infer the slot area for the new domain attraction as the user mentioned 'close the restaurant'. We noticed that that at later dialogue steps such as the 6 th turn, our decoded dialogue state is not correct possibly due to the incorrect decoded dialogue state in the previous turn, i.e. 5 th turn.
In Figure 2 and 3, we plotted the Joint Goal Accuracy and BLEU metrics of our model by dialogue turn. As we expected, the Joint Accuracy metric tends to decrease as the dialogue history extends over time. The dialogue agent achieves the highest accuracy in state tracking at the 1 st turn and gradually reduces to zero accuracy at later dialogue steps, i.e. 15 th to 18 th turns. For response generation performance, the trend of BLEU score is less obvious. The dialogue agent obtains the highest BLEU scores at the 3 rd turn and fluctuates between the 2 nd and 13 th turn.    Figure 4: Example dialogue with the input system response R t−1 and current user utterance U t , and the output state BS t and system response R t . Compared with TSCP, our dialogue state and response are more correct and closer to the ground truth.