Parallel Interactive Networks for Multi-Domain Dialogue State Generation

The dependencies between system and user utterances in the same turn and across different turns are not fully considered in existing multi-domain dialogue state tracking (MDST) models. In this study, we argue that the incorporation of these dependencies is crucial for the design of MDST and propose Parallel Interactive Networks (PIN) to model these dependencies. Specifically, we integrate an interactive encoder to jointly model the in-turn dependencies and cross-turn dependencies. The slot-level context is introduced to extract more expressive features for different slots. And a distributed copy mechanism is utilized to selectively copy words from historical system utterances or historical user utterances. Empirical studies demonstrated the superiority of the proposed PIN model.


Introduction
Spoken dialogue system (SDS) is an application that can help users complete their goals efficiently. A SDS usually has a logic engine, called dialogue manager, which involves two main sub-tasks for determining how the system will respond to the users: dialogue state tracking and dialogue policy learning. The task we discuss in this paper is dialogue state tracking, which allows the system maintaining a internal representation of the state of the dialogue as the dialogue progress .
Dialogue state tracking involving single domain has been extensively studied and achieved mach progress. As more challenging task, Multidomain dialogue state tracking (MDST) has been introduced in (Ramadan et al., 2018) and attracts much attention in research community.
Instead of only predicting the (slot, value) pair, in MDST, a model is expected to predict the * Corresponding author (domain, slot, value) triplets for each slot in each domain. This task is a great challenge not only because of the large ontology involving 30 slots and exceeding 4500 values , but also the mixed-domain nature of the dialogues and some complex cases involving cross-turn inference.  Several models have been proposed for MDST task and proven to be successful (Mrksic et al., 2015;Goel et al., 2019;Eric et al., 2019;Lee et al., 2019;. Among these models, TRADE  achieves the state-of-the-art on the MultiWOZ 2.0 dataset (one of the standard MDST datasets) by encoding the entire dialogue history using a bidirectional GRU and incorporating soft-gated copy mechanism to generate the values. Inspired by TRADE, we purpose to build a more accurate and robust state generator PIN. The motivations of proposing PIN is in two aspects.
One aspect is considering the interactive nature of the dialogues. The interaction of the user and the system is often organized by a questionanswering style. It is common in dialogue state tracking that a domain or slot being specified by one of the user or system, then the value being answered by the other. For example, in the dialogue in Figure 1, the user specifies a Restaurant domain, and the system answers a restaurant name Curry Garden. As is shown in Figure 1, there are two type of dependencies, in-turn dependencies and cross-turn dependencies, both contribute to discovering slot-value pairs. It is worth noting that some hard cases involving inference actually rely on cross-turn dependencies (e.g. the dependency between utterance s2 and u3 in Figure 1). Thus a correctly modelling of these dependencies can improve slot-value extraction and cross-turn inference. In this work, we build an Interactive Encoder which completely accords with the dependencies expressed in Figure 1 to jointly model the in-turn dependencies and cross-turn dependencies.
The interactive nature of dialogues also implies that the value for a slot tends to be specified frequently either by a system or by a user. For example, the values for slots involving names, such as Restaurant-name and Hotel-name, are likely to be provided by the system. And the values for the slots like Hotel-stay (the days to stay) and Hotelpeople (the number of people booking for) are usually provided by the user. This observation inspires our designing of the distributed copy mechanism which allows the state generator choosing to copy words from either the historical system utterances or the historical user utterances.
The other aspect is the slot overlapping problem in MDST. Unlike single-domain DST, slot overlapping is common in MDST and these overlapping slots share the similar values. For example, both Restaurant and Hotel domain have a slot price range which shares the same values. Under this condition, a generator without considering slot-specific features may mistakenly extract the value of one slot as the value of some other slot. To overcoming the slot overlapping problem, we introduce a slot-level context in the state generator.
In summary, we propose a generation-based MDST model which takes into consideration of the interactive nature of dialogues and slot overlapping problem in MDST. The contributions of this work are as follows.
• We propose a interactive encoding method with two parallel hierarchical recurrent networks which can jointly model the in-turn dependencies and cross-turn dependencies.
• We introduce the slot-level context into the state generator to accurately generate the values for overlapping slots.
• We present a distributed copy mechanism to selectively copy words from either the historical system utterances or the historical user utterances.

Problem Statement
In multi-domain dialogue state tracking, the state is usually expressed as a set of (domain, slot, value) triplets.
The domain refers to the topics of the dialogue, such as the Restaurant domain, which indicates that the dialogue involves restaurant booking. The slot is an aspect of the user's goals, such as food, area and pricerange in the restaurant-booking dialogues. And the value is the user's specific interests, such as chinese value for food slot that indicates the user is interested in the Chinese food. The dialogue state is maintained so as to track the progress of the dialogue. At each turn, the system generates a system utterance in natural language, and the user responds to the system with some sentences, referred to as user utterance. The objective of multi-domain dialogue state tracking is to predict the value of each (domain, slot) pair at each turn given the historical system utterances and user utterances. In this paper, the multi-domain dialogue state tracking is treated as a sequence generation task, where each word of a value is generated from a state generator.

Methodology
In this section, we introduce the proposed PIN model. The model consists of four components: Interactive Encoder, Slot-level Context, Value Generator and Slot Gate. We next describe each component in detail.

Interactive Encoder
Our design of the Interactive Encoder is inspired by the dependencies between the system and user utterances. Specifically, we wish to propose a novel network structure that completely represents the dependencies expressed in Figure 1. A hierarchical recurrent networks with specific structures has been used to construct the Interactive Encoder, as shown in Figure 2. The Interactive Encoder consists of two parallel hierarchical recurrent networks, one for historical system utterance encoding and another for historical user utterance encoding. The lower layer of the hierarchical recurrent networks allows each word to capture the crossturn dependencies; and the higher layer of the hierarchical recurrent networks allows each word to capture the in-turn dependencies. In this way, the cross-turn dependencies and in-turn dependencies are jointly modeled. We now present the details of the Interactive Encoder. Let A l = {a 1 , a 2 , · · · , a m } denotes the sequence of word embeddings for the the l th system utterance. And U l = {u 1 , u 2 , · · · , u n } denotes the sequence of word embeddings for the the l th user utterance. Here m and n denotes the number of words in the l th system utterance and user utterance respectively.
For later use, we introduce a notation GRE(X, h; W) to indicate the bi-directional GRU encoder (Chung et al., 2014) with inputs X (sequence of vector representations, such as word embeddings), parameters W and initialized hidden state h. The Interactive Encoder jointly models the cross-turn dependencies and in-turn dependencies through the following recurrent process.
The Interactive Encoder first let the input word embedding sequences A l and U l interact with the historical context, allowing the words capturing cross-turn dependencies where W a and W u are the parameters of the GRUs, the initialized hidden states h a l−1 and h u l−1 are respectively the system context vector and the user context vector generated from the last turn. G a l and g a l denote the entire sequence of output vectors and the last output vector of the GRUs, respectively.
The outputs of the lower-layer GRUs, G a l and g a l , are then feed into the higher-layer GRUs to interact with the current context for capturing inturn dependencies where M a and M u are the parameters of the higher-layer GRUs, and h a l and h u l are the generated system context vector and user context vector of the current turn. h a l and h u l are then feed into the lower-layer GRUs as the initialized hidden states of the next turn.
With this recurrent architecture, the Interactive Encoder captures the dependencies of the entire dialogue history by rolling from the first turn to the current turn. At the beginning of a dialogue, we simply set the initialized hidden states as zero vectors , that is h a 0 = h u 0 = 0. The outputs from each turn of the dialogue are then concatenated as the system context sequence Here M and N denote the total number of words in historical system utterance and historical user utterance respectively.

Slot-level Context
The purpose of applying the slot-level context here is to strengthen the context representation with slot specific features and deal with the slot overlapping problem. We simply employ the attention mechanism to construct the slot-level context. Specifically, for each (domain, slot) pair, we introduce an embedding vector v s . The slot-level system context c a s and the slot-level user context c u s are computed by Slot Gate The slot-level context of the entire dialogue history is then simply the summation of the slot-level system context and the slot-level user context The slot-level context is then feed into the Value Generator as the initialized hidden state for the decoder GRU.

Value Generator
The Value Generator takes the slot-level context as input and uses a GRU decoder to generate the value sequence for each (domain, slot) pair. Different from the copy mechanism applied in TRADE ) that copying words from the entire dialogue history, in this paper, we propose a distributed copy mechanism that allows the state generator copying words from different sequences. The architecture of the Value Generator is shown in Figure 3. we now describe it in detail.
We use the abbreviation GRD to denote the GRU decoder. At the t th decoding step, the hidden state of the GRU decoder for each (domain, slot) pair s is where x t s is the input at the t th step, o t s is the hidden state at the t th step and W d is the parameters of the GRU decoder. The hidden state of GRD for each slot is initialized with corresponding slotlevel context c s . The first input x 0 s is set as the summation of corresponding domain embedding and slot embedding.
We then introduce three distributions on the vocabulary: P v s,t , P a s,t and P u s,t , for applying distributed copy mechanism. The three distributions represent the probabilities of generating a word from the vocabulary, copying a word from the historical system utterances and copying a word from the historical user utterances, respectively. Let e i be the embedding of the i th word in the vocabulary and |V | be the vocabulary size. We use P s,t [i] to denote the i th element in P s,t . Then the three distributions are computed by where the function f is used for mapping a distribution on the dialogue-history to corresponding distribution on the vocabulary. The three distributions, P v s,t , P a s,t and P u s,t are then combined by learnable weights. We define α s,t as the weight of generating from the vocabulary and β s,t as the weight of choosing to copy a word from the system utterances. For calculating the weights α s,t and β s,t , we first generate new feature vectors The weight α s,t and β s,t are then computed by where W v and W c are the parameters of the linear functions, and σ denotes the logistic function. The final distribution P s,t is then calculated as the weighted sum of distributions P v s,t , P a s,t and P u s,t as follows The t th word of the value for (domain, slot) pair s is then generated from distribution P s,t . The embedding of the generated word is then used as the next input of the GRU decoder. This generation procedure allows the state generator generating words from the vocabulary or copying words from either the historical system utterances or the historical user utterances.

Slot Gate
Following TRADE , we introduce the slot gate to predict the special values none (the value of the slot is not expressed yet) and dontcare (the user does not care about the slot) for each (domain, slot) pair. Specifically, the slot gate is a three-class classifier, which aims to identify whether the value none, dontcare or other value is expressed from the context through a softmax classifier where W s is the parameter of the softmax classifier. For a (domain, slot) pair, if the output of the slot gate is none or dontcare, the generated word sequence from the state generator will be ignored and the corresponding predicted result of the slot gate will be chosen as the value. Otherwise, the generated word sequence from the state generator will be the predicted value for the (domain, slot) pair.

Loss Function and Optimization
The cross-entropy loss is built for optimizing both the Value Generator and the Slot Gate, simultaneously. Let S be the total set of (doamin, slot) pairs, and T s be the number of words in the value for slot s ∈ S. We define y c s as the ground-truth one-hot label vector of the slot gate and y v s,t as the one-hot representation of the t th word in the value of s. The loss function is then defined as The loss function can be optimized by stochastic gradient descent(SGD) method.

Implementation Details
The proposed model is implemented using the Pytorch deep learning framework. All the embeddings of the words in the vocabulary are initialized by the concatenation of the pre-trained GloVe embeddings (Pennington et al., 2014) and character n-gram embeddings (Hashimoto et al., 2017). We tune the hyper-parameters of the models by grid search on validation set. The batch size is set as 32. The dimensions of hidden states in all GRUs are set as 400. The embedding dropout is used in the Interactive Encoder with dropout rate 0.3. Following (Bowman et al., 2016;, we also adopt the word dropout in the Interactive Encoder to improve the model generalization; and the dropout rate is set as 0.3. At training time, the Value Generator uses Teacherforcing (Williams and Zipser, 1989) with a probability 0.5. The greedy search (Vinyals and Le, 2015) is used in the decoding process. We use the Adam optimizer (Kingma and Ba, 2015) with initialized learning rate 0.001 to optimize the model.

Evaluation Metrics
The standard metrics joint goal accuracy and goal accuracy are used to evaluate the performance of the multi-domain dialogue state tracking. The joint goal accuracy denotes the proportion of dialogue turns where the values of all the (domain, slot) pairs are correctly predicted. While the goal accuracy is the proportion of the correctly predicted (domain, slot, value) triplets in total test set.

Experimental Results
Evaluation on the MultiWOZ 2.0 dataset. The evaluation results on the MultiWOZ 2.0 dataset are shown in Table 1. We observe from the table that most of the models building classifiers on predefined ontology and the models generating the dialogue states through copy mechanism are inferior to the models utilizing both the classifiers and the copy system. As is mentioned in (Eric et al., 2019) that the models building upon a copy system have advantage in extracting values from the dialogue history but struggle to predict values that are not exist in the dialogue history. Thus it is reasonable that models combining copy system with state classifiers achieve better performance. Comparing with the current state-of-the-art model TRADE, PIN achieves significant 3.82% performance gain. This fact demonstrates that the modeling of the interaction dependencies, the slot-level context and the distributed copy mechanism are beneficial for improving state generation. ). The PIN model outperforms the previous models by a significant margin except for the DST-Picklist model, which indicates the effectiveness of the model design. Although DST-Picklist achieves the state-of-the-art, it takes a lot human efforts in dividing the slots into span-based slots or picklist-based slots.

Evaluation on the Overlapping Slots
In multi-domain dialogue state tracking, domains may have overlapping slots. One of the motivations for building the PIN model is to handle the slot overlapping problem with slot-level context. Thus we report the goal accuracy on each overlapping slot in Table 3 for further analysis on PIN. Table 3 shows that slot overlapping usually appears among similar domains, such as (Restaurant, Hotel, Attraction) and (Train, Taxi). The PIN model achieves higher goal accuracy than TRADE on all overlapping slots. This result demonstrates the effectiveness of the slot-level context on extracting distinctive features for each slot, so that the values for overlapping slots are correctly predicted. In order to study the function of the Interactive Encoder in handling in-turn and cross-turn dependencies existing in MDST, we use a subset of the testing data for error analysis. We first sample 100 dialogue turns from the test set of the Mul-tiWOZ 2.1 dataset. Then the wrongly predicted (domain, slot, value) triplets for the TRADE and PIN model are selected and each of the triplets is marked according to the dependencies (in-turn or cross-turn) it is involved in the dialogue. The statistics of these error predictions are shown in Figure 4. From the figure, we observe that whether in the dialogue turns involving in-turn dependencies or cross-turn dependencies, the PIN model creates much less prediction errors than the TRADE model, especially on hotel domain and restaurant domain. These results demonstrate the effectiveness of the Interactive Encoder in capturing the in-turn and cross-turn dependencies.

The Function of the Distributed Copy Mechanism
Unlike the traditional copy mechanism that only copies words from one sequence, the distributed copy mechanism in the PIN model can copy words from two separated sequences considering the interactive nature of dialogues. The example in Figure 5 shows a case that the traditional copy mechanism will make a wrong prediction but the distributed copy mechanism will correctly predict. The dialogue in Figure 5 is sample from the Restaurant domain in the test set. In this example, we want to predict the value of the food slot. As the wight α = 0.37, the generator has higher probability to copy a word from the dialogue history. In the total dialogue history, if we ignore the wight β, which determines whether to copy from the historical system utterance or from the historical user utterance, the generator will copy the wrong word reservation from the entire dialogue history because the word reservation has higher copy probability 0.668 than 0.507 of the word european. This wrong prediction will happens in the traditional copy-based model. But in PIN, the word to be copied also depends on the sequence-selection weight β. With a probability 0.967 to copy the word from the historical user utterance, the correct value european will be copied according to Equation 9. This case demonstrates the effectiveness of the distributed copy mechanism.

Related Works
The dialogue state tracking (DST) problem has attracted the research community for years. The traditional single domain dialogue state tracking that focus on predicting dialogue states on specific domain has been studied intensively and achieve remarkable success Wang and Lemon, 2013;Liu and Perez, 2017;Jang et al., 2016;Shi et al., 2016;Vodolán et al., 2017;Yu et al., 2015;Henderson et al., 2014;Zilka and Jurcícek, 2015;Mrksic et al., 2017;Xu and Hu, 2018;Zhong et al., 2018;Ren et al., 2018). Some of these models solve DST problem by incorporating a natural language understanding (NLU) module Wang and Lemon, 2013) or jointly modeling NLU and DST (Henderson et al., 2014;Zilka and Jurcícek, 2015). These models rely on hand-crafted features or delexicalisation features, which make them difficult to scale to realistic applications. The representation learning approach has then been employed in NBT (Mrksic et al., 2017). Following representation learning approach, the GLAD (Zhong et al., 2018) utilizes the recurrent neural networks (RNN) and the self-attention to handle rare-slot-value problem. And the StateNet (Ren et al., 2018) model propose a multi-scale receptors to extract semantic features and use LSTM for state tracking. The generation-based approach for dialogue state tracking is first adopted in PtrNet (Xu and Hu, 2018) for handling the unknown-value problem. It utilize the Pointer networks (Vinyals and Le, 2015) to predict the indexes of the value sequence. For multi-domain state tracking, the first work that involves state tracking in mixed domains is (Mrksic et al., 2015). This work proposes a pretraining procedure to improve the performance on a new domain. The work of (Rastogi et al., 2017) uses bi-directional GRU to extract features and predict the value by a candidate scoring model. The MDBT (Ramadan et al., 2018) model applies multiple bi-directional-LSTM to jointly track the domain and states. It adopts se-mantic similarity between the ontology and utterances and allows parameter sharing across domains. The HyST (Goel et al., 2019) model combines a classification-based system and a n-gram copy-based system to deal with multi-domain dialogue state tracking problem. The FJST and HJST model presented in (Eric et al., 2019) employ flatten structured LSTM and hierarchical structured LSTM to encoding the dialogue history respectively. The TRADE model  combines the soft-copy mechanism to generate states and a slot gate to classify special values for each slot.

Conclusion
This paper studies the problem of state generation for multi-domain dialogues. Existing generationbased models fail to model the dialogue dependencies and ignore the slot-overlapping problem in MDST. To overcome the limitation of existing models, we present novel Parallel Interactive Networks (PIN) for more accurate and robust dialogue state generation. The design of the PIN model is inspired by the interactive nature of the dialogues and the overlapping slots in the ontology. The cross-turn dependencies and the in-turn dependencies are characterized by the Interactive Encoder. The slot-overlapping problem are solved by introducing the slot-level context. Further more, a distributed copy mechanism is introduced to perform a selective copy from either the historical system utterances or the historical user utterances. Empirical studies on two benchmark datasets demonstrate the effectiveness of the PIN model. PAD ; ok , i found the cambridge lodge restaurant . would you like · · · · · · i would suggest · · · · · · . would you like me to make a reservation ? · · · · · · i am looking for a european restaurant in the west of cambridge; · · · · · · i really need someplace expensive , it is a special occasion for me · · · · · · Figure 5: An example of dialogues and prediction of PIN. The red color represent the copy probability of the word. And the copy probability of the word reservation in system utterances is 0.668, the copy probability of the word european in the user utterances is 0.507.