SAS: Dialogue State Tracking via Slot Attention and Slot Information Sharing

Dialogue state tracker is responsible for inferring user intentions through dialogue history. Previous methods have difficulties in handling dialogues with long interaction context, due to the excessive information. We propose a Dialogue State Tracker with Slot Attention and Slot Information Sharing (SAS) to reduce redundant information’s interference and improve long dialogue context tracking. Specially, we first apply a Slot Attention to learn a set of slot-specific features from the original dialogue and then integrate them using a slot information sharing module. Our model yields a significantly improved performance compared to previous state-of the-art models on the MultiWOZ dataset.


Introduction
The recent global adoption of personal assistants such as Alexa and Siri made dialogue system a more popular topic in research. The major difference between dialogue systems and questionanswering is that dialogue systems need to track dialogue history effectively. So, we normally use a dialogue state tracking component to track user's intention throughout the conversation. A dialogue state is typically composed as a set of slot value pairs in a task-oriented dialogue, such as "hotelinternet-yes". It means the slot "hotel-internet" has a value of "yes".
Early dialogue state tracking model needs a predefined ontology which means the values of every slot are enumerated in advance (Henderson et al., 2014;Mrkšić et al., 2017;Zhong et al., 2018;Sharma et al., 2019). Such practice is inefficient and costly. The large number of possible slot-value pairs makes deploying these models in the real-life * *Corresponding author. applications difficult (Rastogi et al., 2017). This difficulty is further amplified in multi-domain dialogue state tracking where the dialogues have more than one tasks. Because the manual effort grows exponentially with the complexity of the dialogues. In (Wu et al., 2019), Wu et al. introduced a transferable dialogue state generator (TRADE), which can generate dialogue states from utterances using a copy mechanism. This generative model achieved relative good performance, but it still has trouble in extracting relevant information from the original dialogues. For example, a user may tell the agent that he/she needs a taxi in a turn, but the taxi's departure location is implicitly mentioned several turns ago. Inspired by the (Chen et al., 2017;Chen, 2018), (Chen et al., 2019) studied on utilizing attention mechanism to deal with the long distance slot carryover problem. In their work, they first fused the information of the slot, its corresponding value and the dialogue distance into a single vector. Then they computed the attention between this single vector and the concatenation of dialogue and intent information. We simplify the attention method and introduce it into the dialogue state tracking task.
Moreover, it is a common sense that there is some kind of relevance between two slots involving the same domain or the same attribute. For example, people tend to have a meal near the attraction they visit, so slot "attraction-area" and slot "restaurant-area" have the same value at most times. For these slots with a common or related value, if a slot never or seldom appears in the training set, sharing the learned feature of data-sufficient slot may benefit the model's tracking ability on these rare or unknown slots.
So we propose SAS, a new multi-domain dialogue state tracking model to resolve this issue to some extent. To be specific, we use an Slot Attention to localize the key features from the original information-excessive dialogue and a Slot Infor-mation Sharing to improve the models ability to deduce value from related slots. The processed information provided by the slot attention and the sharing module makes the generator more sensitive to the location of the values in the dialogue history and thus generates correct slot values. Experiments on the multi-domain MultiWOZ dataset  shows SAS can achieve 51.03% joint goal accuracy and outperform previous state-of-the-art model by 2.41%. On the single domain dataset which only contains the restaurant domain, we achieve 67.34% joint goal accuracy, outperforming prior best by 1.99%. In addition, we conduct an analysis of the experimental results to evaluate the quality of values generated by our model.

Related Work
The early research of DST focused on the pipelined approach which involves a special module named Spoken Language Understanding (SLU) before the DST module (Wang and Lemon, 2013;Williams, 2014;Perez and Liu, 2017). But obviously, it was not reasonable to train SLU and DST respectively since the accumulated error in SLU may be passed to the DST. In order to alleviates this problem, later study focuses on the joint training methods (Henderson et al., 2014;Zilka and Jurcicek, 2015;. Although the higher performance shows the effectiveness of models without SLU, there still remains some shortcomings. For example, these models typically rely on semantic dictionaries which list the potential rephrasings for all slots and values in advance. Make such a list is costly. Fortunately, the recent development of deep learning and representation learning helps the DST to get rid of this problem. (Mrkšić et al., 2017) proposed a novel Neural Belief Tracking (NBT) framework which was able to learn distributed representations of dialogue context over pre-trained word vectors, while (Dernoncourt et al., 2017) described a novel tracking method which used elaborate string matching and coreference resolution to detect values explicitly presented in the utterance. These models greatly improve the performance of DST, but they are not good at handling rare and unknown slot value pairs which seldom or never appear in the training set.
There were many efforts to exploit general features between rare slot value pairs and common ones. (Zhong et al., 2018) proposed GLAD, a model which built global modules to share parameters between estimators for different slots and local modules to learn slot-specific features. (Nouri and Hosseini-Asl, 2018) improved GLAD by reducing the latency in training and inference time, while preserving its powerful performance of state tracking. But as the dialogues become increasingly complex, the performance of these models on multi-domain is not as satisfying as on single domain. Because of the dependency on the dialogue ontology, they have difficulty in scaling up with domains. Once the number of domains increases, the amount of slot value pairs will boom. With the copy mechanism, the sequence-to-sequence model TRADE (Wu et al., 2019) successfully got rid of any predefined slot value pairs and generated dialogue states from conversation utterances.
But we find there still remain several crucial limitations which have not been well solved on multidomain dialogues. First, these models rely on the long dialogue history to identify the values which belong to various domains and slots. Sometimes the information contained in the dialogue history is too rich for these models to efficiently utilize and the redundant information tends to interfere with their value identification or value generation. Second, the related information among similar slots is wasted. To alleviate these problems, a slot attention and a slot information sharing module are suggested. The former can isolate the most valuable information for each slot, while the latter integrates information kept by its all similar slots and improve the models ability to deduce value from related slots.

Task Definition
The dialogue state tracking models take the interaction context as input and extract slot value pairs explicitly or implicitly presented in conversations. The combinations of these slot value pairs are the representations of the user's goal. In this paper, we denote X = {(u 1 , r 1 ), · · · , (u T , r T )} as the dialogue history, where u 1 , · · · , u T and r 1 , · · · , r T are respectively the set of user utterances and the set of system responses in T turns. The dialogue state of turn t is marked as ST t = (slot: s j , value: y value j ). Here, s j indicates the j-th slot, while y value j means the ground turth value sequence for this slot. All the slots in ontology are obtained by preprocessing the original MultiWOZ dataset with the delexicalization. Moreover, we extend the def- inition of the slot to include the domain name for convenience. For instance, a slot in this paper will be "hotel-star", rather than "star". Our primary goal is to learn a generative dialogue state tracker model M : X × O → ST that can efficiently capture the user's intentions for dialogues including multiple domains. And unlike most of the previous models, the ontology O mentioned in this paper only contains the predefined slots and excludes their values. Figure 1 shows the architecture of SAS. SAS is a sequence-to-sequence model augmented with slot attention and slot information sharing. Slot attention enables better feature representation and slot information sharing helps understanding less-seen slots. We describe the details of every component in SAS as follows:

Encoder
We use a 1-layer bidirectional gated recurrent unit (GRU) (Chung et al., 2014) to encode the dialogue history. As TRADE (Wu et al., 2019), our input to the model is the concatenation of all words in the recent l-turn dialogue history X t = [u t−l+1 , r t−l+1 , · · · , u t , r t ] ∈ R |Xt|×d emb , where d emb means the embedding size. First, each word in the dialogue history X is mapped to a distributed embedding vector. Then, a GRU is utilized to obtain the hidden state corresponding to each word in the text and we denote these hidden state as the his-

Slot Attention
To isolate key features from the noisy dialogue history, we build the slot attention. In fact, the multidomain dialogues are usually complex and contain rich features. This challenges the model's ability to cope with the excessively rich information.
To be specific, in one dialogue, user can mention various information, such as wanting to book a restaurant for a meal and then planning to see an attraction after the meal by ordering a taxi. There are in total 10 slots mentioned spanning across restaurant, attraction and taxi domains. Information from one domain maybe not useful for other domain and can even cause confusion. For example, both restaurant and taxi mention time and people.
So we propose the slot attention to only extract useful history information to every slot. More concretely, for a particular slot s j , we first encode its slot name into slot hidden states SH j = [sh enc j 1 , · · · , sh enc j |N | ], where |N | is the maximum size of the slot name phrase. Since the last hidden state sh enc j |N | provided by the GRU contains the context information of the entire phrase, we pick it as the representation of slot s j .
After that, we calculate the attention between the slot information, sh enc j |N | and the hidden states of the dialogue history H t = [h enc 1 , · · · , h enc |Xt| ] to obtain the context vector c j : Here, the score sc j t indicates the relevance between info slots s j and dialogue history. The context vector c j ∈ R d hdd denotes the slot-specific information grabbed from the entire dialogue history. Finally, we obtain the context vectors c = [c 1 , c 2 , · · · , c J ] ∈ R d hdd ×J for all J slots.

Slot Information Sharing
In the slot information sharing, there is a special matrix called the slot similarity matrix. This matrix controls the information flow among similar slots. We introduce two sharing methods according to their different calculation of the slot similarity matrix: fix combination sharing and the k-means sharing. We will compare the effectiveness of the two methods in Section 6.

Fix Combination Method
We calculate the similarity between every two slots to construct switch matrix. We first compute the cosine similarity over the two slot names and then calculate the similarity over the slot types. Specifically, the slot types can be divided into several categories such as "date", "location". For example, if there are two slots "restaurant-area" and "restaurant-book day", then the similarity in the first part may be high since the two slot names share a common word "restaurant", while the similarity in the second part is quite low: slot "restaurantarea" has a value whose type is "location", and "restaurant-book day" has a value which belongs to "date". Next, the two calculated similarities sname and vtype will be integrated with a hyperparameter α ∈ [0, 1] and we can get a special matrix sim ∈ R J×J as a result.
Here, the integration ratio α actually controls the final similarity of the slots. In Table 2, we show that different choices of this ratio will impact the model's tracking performance.
After that, matrix sim is transformed into the slot similarity matrix M by the mask mechanism.
Here, hyperparameter β acts as a threshold to decide whether the two slots are similar enough to trigger the sharing switch and open the information path between them.

K-means Sharing Method
Since the fix combination method needs manual efforts to search for the best hyperparameter, we propose another method, K-means Sharing Method, which requires no hyperparameter tuning and can achieve an averagely good performance. In this sharing method, we also compute the slot name similarity sname ij and the value type similarity vtype ij between slot s i and s j as the way in the fix combination one. Then we put vectors (sname ij , vtype ij ) onto flat space and divide these vectors into two groups by the k-means clustering algorithm. One group stands for the slot s i and s j are similar enough, while the other one not similar. The element in M ij is 1 if they are in similar group, 0 if they are in unsimilar group.
After getting the slot similarity matrix whose value is either 1 or 0, we do the matrix multiplication between the context vectors c = [c 1 , c 2 , · · · , c J ] ∈ R d hdd ×J and the slot similarity matrix M ∈ R J×J . Then we get the integrated vectors int = [int 1 , int 2 , · · · , int J ] ∈ R d hdd ×J . These new vectors keep more expressive information for every slot. Specifically, int j is calculated as following: As shown in the above equation, int j is essentially the integrated result of all related context vectors c i in c and the integration is guided by the slot similarity matrix M . The matrix M actually plays the role of a switch which controls the information flow between slots and provides a selective integration. For example, this integration makes the data-insufficient slot "attraction-type" receive the information from its related and data-sufficient slot "attraction-name", and helps our model deduce the related value for data-insufficient slots.

Decoder
The value prediction process of our decoder can be divided into two steps: first, predicting whether the value of a certain slot is constrained by the user; and then extracting the value if the constraint is mentioned in the dialogue.
In the first step, a three-way classifier called slot gate is used and it can map a vector taken from the encoded hidden states H t to a probability distribution over "ptr", "none", and "dontcare" labels. Once the slot gate predicts "ptr", the decoder will fill the slots with the values extracted from the dialogues. Otherwise, it just fills the slots with "not-mentioned" or "does not care".
In the second step, another GRU is utilized as the decoder. During the decoding step of the jth slot, given a sequence of word embeddings [w j 1 , w j 2 , · · · , w j |N | ], the GRU transforms them into decoded hidden states [h dec j 1 , h dec j 2 , · · · , h dec j |N | ] with the slot's integrated vector int j : Here, |N | is the length of the slot sequence and int j is the initial hidden state input h dec j 0 . The integrated vector int j makes the decoded hidden states contain more information about the dialogue history. So they are more sensitive about whether the value of slot j is mentioned in the dialogue and where it locates. With the decoded hidden state h dec j k , the generator computes P gen jk , the probability of the value generated from the vocabulary list E ∈ R |V |×d hdd and P copy jk , the one copied from the interaction history. |V | is the vocabulary size and d hdd is the dimension of the hidden state. In the end, we sum the probability P gen jk and P copy jk to yield the final prediction P jk : P jk = g jk × P gen jk + (1 − g jk ) × P copy jk (13) Here, g jk is a scalar which controls the model behaviour. It determines whether to generate values from the vocabulary list or copy words from the historical context.

Experiments
In this section, we first introduce the dataset and the evaluation metrics. We then describe our model's implementation details. Finally, we show our baseline models.

Datasets and Metrics
MultiWOZ ) is a fullylabelled collection of human-human written conversations spanning over multiple domains and topics. There are 7,032 multi-domain dialogues consisting of 2-5 domains in MultiWOZ. Because these dialogues have multiple tasks, so the long dialogue history makes state tracking more difficult. Since there are no dialogues from hospital and police domains in validation and testing sets of MultiWOZ, we follow TRADE (Wu et al., 2019) and use five out of the seven domains to train, valid and test, including restaurant, hotel, attraction, taxi and train. These domains involve 30 slots.
We also test our model on a subset of MultiWOZ which only contains the dialogues from the restaurant domain to verify whether our model still works for single-task dialogues.
We evaluate all the models using two metrics, slot accuracy and joint goal accuracy, similar to (Nouri and Hosseini-Asl, 2018): • Slot accuracy. We use slot accuracy to check whether each single slot in the ground truth dialogue states is correct. The metric only focuses on if the slot requested is correct or not.
• Joint goal accuracy. The joint goal accuracy is used to evaluate whether the user's goal in each turn is captured. Only when every slot in the ground-truth dialogue state is considered and has correct value, can we consider the joint goal is achieved. It is the most important metric in the dialogue state tracking task.

Implementation Details
We use the concatenation embedding of GloVe embedding (Pennington et al., 2014) and the characterwise embedding (Hashimoto et al., 2017) in the  Table 2: Results evaluated on the MultiWOZ(except hotel) dataset. "RT shr" means the fix combination sharing method, "KM shr" is the k-means sharing method, and "HM shr" is the human evaluated sharing method. The two numbers after "-" represents the integration ratio α and the threshold β respectively. experiment. The model is trained with ADAM optimizer (Kingma and Ba, 2014) and a batch size of 32. Both the encoder and the decoder use 400 hidden dimensions. The learning rate is initially set to 0.001, but once the joint goal accuracy does not rise with the training, the network will automatically decrease its learning rate to improve the performance. We apply dropout with 0.2 dropout rate for regularization (Srivastava et al., 2014). Besides that, a word dropout technique is also utilized in the way proposed by (Bowman et al., 2015) which simulates the out-of-vocabulary setting. Our k-means clustering algorithm is implemented with the sklearn module, and we set all the hyperparameter in k-means algorithm as default.

Baseline Methods
We compare SAS with several previous methods: MDBT, GLAD, GCE, SpanPtr and TRADE. Based on the classical NBT model, MDBT (Ramadan et al., 2018) extended the task into multiple domains. MDBT makes full use of the semantic simi-larities between the dialogue and the slot ontology to track the domain and the value of the slot jointly. GLAD relies on global modules to learn the general information and local modules to catch the slotspecific information (Zhong et al., 2018) from the dialogues. GCE efficiently improves and simplifies GLAD, while keeping the excellent performance of GLAD (Nouri and Hosseini-Asl, 2018). SpanPtr first introduces the pointer network  into the dialogue state tracking task to extract unknown slot values (Xu and Hu, 2018). And in that paper, they also apply an effective dropout technique for training. TRADE directly generates slot values from the dialogues by using the copy mechanism and gets rid of the predefined value list (Wu et al., 2019). It achieves the previous state-ofthe-art performance. We use the fix combination version of SAS in Table 1 with the integration ratio α of 0.8 and the threshold β of 0.8. That's the best hyperparameters we find for MultiWOZ.

Results
In this section, we first show the result of our model on MultiWoZ dataset, then on Multi-WoZ(restaurant) and MultiWOZ (except hotel) dataset. After conducting the ablation experiment, we also display the improvement the slot attention and slot information sharing brings.
Our model achieves the best performance in the most important metric, joint goal accuracy. Our model outperformed the previous state-of-the-art model, TRADE by 2.41% absolute score on joint goal accuracy. We only observe slight increase of slot accuracy compared to TRADE. We suspect it is because TRADE was already achieving nearly 97% accuracy, which is close to the up-bound of the slot accuracy in this task. After carefully checking the error cases, we found these errors mainly come from the difficulty of generating name phrases.
To test SAS's ability on single domain dialogue tasks, we also evaluate our model on the a subset of MultiWOZ which contains only the restaurant search task. As displayed in Table 1, SAS achieved 1.99% improvement over TRADE on the joint goal accuracy as well, suggesting SAS's good performance generalize to single domain task. Table 2 also shows how different choices of the hyperparameters influence the final results. On MultiWOZ, the integration ratio of 0.8 and the threshold of 0.8 are the best hyperparamters. But as illustrated in Table 2, the best integration ratio is no longer 0.8 on MultiWOZ (except hotel). The best values of the integration ratio and the threshold will vary with the ontology.
We also perform ablation study to quantify different modules' contribution. We observe in Table  3 that adding the slot attention improves the state tracking results by 1.37% on MultiWOZ. Such improvement suggests having slot attention that focuses on the key information of the history is useful. And the slot information sharing further enhances the performance by 1.04%. The reason behind this may be that the information sharing of the related slots makes the data-insufficient slot receive more information. This handles the rare or unknown slotvalue problems to some extent. As illustrated in Table 3, a model with the fix combination sharing method performs better than the k-means sharing method. But the fix combination method has an obvious shortcoming. It is difficult to generalize to new ontology. We need search the hyperparameters for every new ontology and these efforts are usually costly and time-consuming. Results in Table 2 and Table 3 indicate that the k-means algorithm provides a more robust model with respect to different parameters.

MultiWOZ
MultiWOZ (  To investigate whether the slot similarity matrices used by the two sharing methods really reflect the similarity among slots, we also compare them with a human constructed similarity matrix. We invite three volunteers to carefully rate (1 or 0) the relationship between every two slots and obtain the slot similarity matrix used in the human evaluated method. As shown in Table 2 and Table 3, the performance of the k-means sharing method is close to the one the human constructed method. This indicates human knowledge cannot further im-prove this task. Besides that, we also notice that the fix combination model usually outperforms the human constructed method, demonstrating that the fix combination model can automatically discover some hidden relationship among all slots that human cannot capture.

Error Analysis
To better understand why our model improves the performance, we investigated some dialogue examples and shown them in Table 4.
In the first dialogue, by asking "Could you also find me a hotel with a moderate price that offers internet?", the user has briefly informed the agent that he/she is looking for a hotel "with internet". The previous model missed the "hotel-internet" in the tracked slots. Because the model is mislead by the long interaction history. Our model learns to focus on important information using the slot attention to track the correct internet slot.
In the second dialogue, although the previous model manages to capture the value "21:30". It still confused "arriveby" with "leaveat". While SAS can distinguish them. We suspect this is because our model can learn the differences between these slots by training on isolated key features per slot without seeing any irrelevant information.
In the third example, the user agrees to visit an attraction named "Christ's College" from many college-type choices the agent suggests. Previous model fetches a wrong message and fills the slot "attraction-name" with "Clare College". In contrast, SAS captures the correct attraction name and also deduces that the attraction type is college. Similar to the first dialogue, the slot attention helps model gain more clean information to detect slot values more accurately. And by sharing the information fetched from slot "attraction-name" with the slot "attraction-type", our model is more sensitive with the value "college".
We also investigate the limitation of our model by analyzing the state tracking errors. We noticed two types of errors. First, SAS can not effectively identify value "dontcare" for most slots. For example, when the agent asks the user about his/her requirement on the hotel rating, though he/she answers "that is not really important for me", the model fails to fill "dontcare" into the slot "hotelstar". We believe this is due to the fact that the meaning of "dontcare" has plenty of expressions, it is much harder for the model to learn the semantic

No Model
Context 1 I am looking for a train that leaves on saturday and arrives by 10:30. // Where are you traveling to and from? // · · · // Yes, that train sounds good. Please book it for me. Could you also find me a hotel with a moderate price that offers internet? // · · · // The north part of town please, preferably in a guesthouse.
of "dontcare" than other slots. Besides that, we also notice that the tracking errors of departure or destination location are still common. The reason may be that location name words are usually rich in variations and have few grammatical feature.

Conclusions and Future Work
We present SAS, an effective DST model which successfully extracts the key feature from the original information excessive dialogue. The slot attention of SAS enables it to isolate the key information for each slot, while the slot information sharing enhances the expressiveness of the infor-mation passed to each slot by integrating the information from similar slots. The sharing allows SAS to generalize on rare slot-value pairs with few training data. Our model reaches the state-of-the-art performance compared with previous models. We believe that SAS provides promising potential extensions, such as adapting our model on other tasks where are troubled by excessive information. Besides that, we also notice that it is hard for SAS to correctly extract names of hotel or attraction which have rich variations. Designing a new model to address these problems may be our future work.