Slot Attention with Value Normalization for Multi-domain Dialogue State Tracking

Incompleteness of domain ontology and un-availability of some values are two inevitable problems of dialogue state tracking (DST). Existing approaches generally fall into two extremes: choosing models without ontology or embedding ontology in models leading to over-dependence. In this paper, we propose a new architecture to cleverly exploit ontology, which consists of Slot Attention (SA) and Value Normalization (VN), referred to as SAVN. Moreover, we supplement the annotation of supporting span for MultiWOZ 2.1, which is the shortest span in utterances to support the labeled value. SA shares knowledge between slots and utterances and only needs a simple structure to predict the supporting span. VN is designed speciﬁcally for the use of ontology, which can convert supporting spans to the values. Empirical results demonstrate that SAVN achieves the state-of-the-art joint accuracy of 54.52% on MultiWOZ 2.0 and 54.86% on MultiWOZ 2.1. Besides, we evaluate VN with incomplete ontology. The results show that even if only 30% ontology is used, VN can also contribute to our model.


Introduction
Dialogue state tracking (DST) is a core component in the pipeline-based task-oriented dialog systems. The goal of DST is to extract the dialogue states which are indicated by a set of (domain, slot, value) triples during conversation. The (domain, slot, value) triple represents that previous conversation involves the slot of the domain and the specific content is the value. For example, as shown in Figure 1, (restaurant, price, expensive) triple means that user wants to reserve an expensive restaurant. A highquality DST model plays a significant role in the * Corresponding Author. dialogue system, because the dialogue states determine the next system action (Chen et al., 2017).
Traditional DST approaches generally rely on ontology already defined, where all slots and their possible values are given. With a predefined ontology, DST is simplified to a classification problem. The goal is to choose the most appropriate value from ontology for the slot Zhong et al., 2018). However, in practical applications, a complete ontology is almost impossible to be defined in advance. To overcome the drawback, span-based (Xu and Hu, 2018; and generation  approaches spring up. The second problem is that some values required by DST cannot be found in utterances due to the diverse descriptions during a conversation. As shown in Figure 1, value expensive was expressed as high end in the first turn. The problem gives rise to the powerlessness of span-based approaches. Recently, Zhang et al. (2019) show a dual strategy that combines the advantages of both the picklist-based and span-based methods. They use ontology in spanbased approaches to deal with the problem and achieve the SOTA performance, which also shows that the ontology is powerful.  introduced a largescale multi-turn dialogue dataset (MultiWOZ) spanning over several domains and topics. As shown in Figure 1, the user initially wants to make a restaurant reservation, then requests information about attractions close to the restaurant, and finally books a taxi. During the conversation, the models for DST should determine whether each (domain, slot) pair has a value in each turn to obtain the most relevant (domain, slot, value) triples. However, unlike single domain DST problems, in which only a few slots need to be tracked, such as four slots in WOZ (Wen et al., 2017), there are a total of 30 (domain, slot) pairs of five domains in MultiWOZ, which can be more in practical applications. Therefore, it requires DST models should determine the slots efficiently.
To tackle these challenges, we emphasize that DST models should optimize the structure of slots determination and utilize ontology more flexibly rather than abandon it. In this paper, we propose to divide the model of DST into Slot Attention (SA) and Value Normalization (VN). Simple and efficient processing of slots and flexible use of ontology are the main advantages of SAVN. Contributions in this work are summarized as † : • SA shares knowledge between slots and utterances and is able to optimize the determination of all slots jointly. Compared to the span-based approach in DS-DST (Zhang et al., 2019), SA improves efficiency by nearly count(slots) times in determining the slots.
• Considering that the number of possible slot values in ontology could be large in the real scenario, VN is designed as a simple, flexible, and effective model to use an ontology, which only needs 8 minutes for training on a V100 GPU. VN can choose to directly output the supporting span from SA or select a value in the ontology.
• We supplement the annotation of supporting † The code will be released at https://github.com/ wyxlzsq/savn. span for labeled values unavailable in utterances on MultiWOZ 2.1, which can help the span-based model learn semantics more fully and help ontology be better utilized.
• We fully evaluate VN with incomplete ontology. The results show that VN can gain positive performance for SAVN as long as the integrity of ontology is more than 30%. And as we expected, the more complete ontology is, the more VN can rely on it.

SAVN model
The overview of the model is shown in Figure 2

Slot Attention
As shown in Figure 3, Slot Attention (SA) accepts two inputs, one of which is the text of the previous conversations, and the other is a list of (domain, slot) pairs. Similar to DS-DST model (Zhang et al., 2019), we also employ BERT (Devlin et al., 2019) as the encoder for utterances. The difference is that we separate slots and utterances and share knowledge between them. Then we directly use the inner product to predict the span in utterances and use an attention module to interact with slots and utterances to classify. Benefiting from this structure, our model can determine the slots in parallel, optimize the determination jointly, and only needs to encode the utterances once for each turn while DS-DST needs to encode count(slots) times. Additionally, for SA to have the ability to output some special words, we added some fixed candidate values in front of the utterances such as yes, no. Let us define X = {(u 1 , r 1 ), ..., (u n , r n )} as the set of user utterances and system responses in a conversation with n turns, C = [a 1 , a 2 , ..., a k ] as the list of k fixed candidate values, and S = [s 1 , s 2 , ..., s j ] as the list of j (domain, slot) pairs. Due to the limitation of the maximum sequence length of BERT, sometimes it is not possible to encode all utterances. Therefore, we set a parameter m to limit the number of turns entered. The input utterances for turn t should be : where U 1 represents u 1 ⊕ r 1 , the ⊕ means to con-catenate the utterances of u 1 and r 1 . r t is the system response of turn t, so r t / ∈ X m t . Then by encoding the utterances of turn t by BERT and embedding the slots by the Embedding module of BERT, we can get the hidden states of utterances and slots as follows: where H u t ∈ R p×h and H s t ∈ R q×h . p is the sequence length of I, q is the number of (domain, slot) pairs and h is the dimension of the BERT hidden state.

Slot Gate Classification
As introduced in Section 1, There are many (domain, slot) pairs in Multi-domain DST problem, which make it more challenging than singledomain DST problem. Similar to TRADE model , we design a classification module with none, dontcare and span as a slot gate. For each (domain, slot) pair, if the slot gate predicts none or dontcare, we ignore the span predicted from utterances and fill the pair with "none" or "do not care".
The module to classify slots is similar to a Transformer (Vaswani et al., 2017) block. We employ the "Scaled Dot-Product Attention" to get an utterances-aware slot representation A s u : where A s u ∈ R q×h . Then in order to better integrate the states of slots and utterances, we add A s u and H s t to get H c as the features to classify: where G t is the slot gates of all (domain, slot) pairs at turn t.

Span-Based Value Prediction
For each (domain, slot) triple, span-based methods obtain the value by predicting a span with start and end position in utterances. In order to make the slot determination more efficient, we simplify the structure of span-based predictions. We can get the span predictions by: where P s t and P e t are the start position distributions and end position distributions of all (domain, slot) pairs at turn t respectively.

Optimization
We can optimize all slots determination jointly. The joint losses at turn t are as follows: where L g is the loss of the slot gate predictions, L s and L e are the loss of the start and end position predictions respectively. And Q is the number of (domain, slot) pairs, y is the true one-hot label. Similar to L s , we can get the end loss L e . Then we optimize the weighted-sum of these three loss functions using hyper-parameters α and β,

Value Normalization
Value Normalization is a flexible module for utilizing ontology, which can also be combined with other DST models. Considering that there are numerous possible values in the ontology and few data for training, we design a simple and effective model for VN, which can also benefit from the pre-trained BERT model. As shown in Figure 4, VN is designed with one layer of the transformer block, which we call VN 1 . By analogy, we can get VN 4 and VN 12 (i.e., use BERT-base model as encoder). The model will load parameters from the corresponding layers of BERT. In Section 4.2, the experimental results show that VN 1 has done well enough for the Mul-tiWOZ dataset.
Let us define T 0 is the hidden state of the first token ([CLS]) after transformer. Then we use T s 0 ∈ R h as the hidden state of the supporting span after encoding and T o 0 ∈ R n×h as the hidden states of n possible values in ontology for the corresponding (domain, slot) pair. We use the inner products of the supporting span and the possible values as the matching scores, which is defined as: Then we can get the max matching value. In addition, we employ the cosine similarity as the value gate since the ontology may be incomplete.  the final output is: where V m is the max matching value, SP is the supporting span and θ is a hyper-parameter. Our loss function for optimizing VN is defined as follows: where y v is the true one-hot label and r = 1 means the value for the supporting span is in ontology. However, r will always be equal to 1 without preprocessing in training because ontology is invariably complete for the training set. In our experiments, we employ full training set to train VN with incomplete ontology in order to get dispersed vector representations for values.

Annotation for Supporting Span
Our annotation work is based on the MultiWOZ 2.1 dataset (Eric et al., 2019), which is a fixed version of the MultiWOZ 2.0 dataset . MultiWOZ 2.1 dataset is a large-scale collection of human-human written conversations over multiple domains and topics, which has labeled 63,662 (conversation, domain, slot, value) quadruples (except "none" value) in the training set. Annotation for supporting span is mainly to address the problem that some labeled value can not be found in the conversations. The causes of this problem can be divided into three categories: varied expressions, spelling mistakes, and annotation errors.
The criterion of annotations is to find the shortest span in the conversations, which can help us get the labeled value. Based on the criterion, we annotate 936 (supporting span, value) pairs on MultiWOZ 2.1 training set, in which varied expressions account for 637 (68%), spelling mistakes account for 123 (13%), and annotation errors account for 176 (19%). Table 1 shows some examples of supporting span annotation.
After annotation, we can change (domain, slot, value) triples in training set to (domain, slot, supporting span, value) quadruples, where the supporting span will be equal to the value if the value can be found in the conversations. Then we employ (domain, slot, supporting span) triples to train SA and (supporting span, value) pairs to train VN. Specifically, we do not use the annotation of annotation errors to train VN, for it should not convert Saturday to Sunday.

Experiments
We evaluate our model on two publicly available datasets: MultiWOZ 2.0 and MultiWOZ 2.1, both of which are fully-labeled task-oriented corpora comprised of human-human written conversations and contain 8,438 multi-turn dialogues with each dialogue having 13.68 turns on average in training set . The difference between MultiWOZ 2.0 and MultiWOZ 2.1 is that MultiWOZ 2.1 has changed more than 32% of state annotations across 40% of the dialogue turns to fix the noisy state annotations in MultiWOZ 2.0 (Eric et al., 2019).
Following previous work , only five domains (i.e., restaurant, hotel, attraction, taxi, and train) are employed in our experiments because the dialogues that belong to the other two domains (i.e., hospital and police) are rare in the training set and do not appear in the test set. As introduced in Section 3, we get a new training set by using the supporting span annotations, which can be called the SP training set. Additionally, there are no changes to the test set and the dev set.

Training Details
We use the pre-trained BERT-base-uncased model as the utterance encoder in SA, which has 12 hid-  den layers with 768 units. For the limitation of the maximum sequence length, We set m (in equation 1) to 9. If the current conversation turn exceeds m, we will combine the predicted dialogue states with the previous dialogue states to complete dialogue states for the current turn.
In our experiments, SA and VN are both trained with Adam optimizer (Kingma and Ba, 2014) in which the learning rate linearly decreases from 5e −5 and 1e −4 , respectively. We have trained SA with 3 epochs and VN with 5 epochs both on Multi-WOZ 2.0 and MultiWOZ 2.1. Specifically for VN, we train VN 1 and VN 12 (introduced in Section 2.2) to compare their performance.
Our results can be reproduced with a 16 GB V100 GPU in 2 hours (8 minutes for VN 1 ).

Results
Two standard metrics, joint accuracy and slot accuracy, can be employed to evaluate the performance of our model. Joint accuracy is the accuracy of dialogue states, which requires that all (domain, slot, value) triples in the dialogue states are predicted correctly. And slot accuracy is the accuracy of (domain, slot, value) triples, which requires that the predicted value of (domain, slot) pair is predicted correctly. The joint accuracy is a more challenging metric, for there is a considerable number of (domain, slot) pairs in dialogue states.
To better evaluate the role of supporting spans, we have trained two versions of SA, one of which utilizes the original training set called SA raw and the other employs the SP training set called SA sp . And we make a comparison with the following existing models: • GLAD (Zhong et al., 2018) shares parameters  among slots by virtue of global modules and applies the local modules to learn slot-specific features.
• Neural Reading  formulates DST as a reading comprehension task. The model encodes the word tokens by a pretrained BERT model, then obtains the contextual representation by LSTM.
• SUMBT  learns the slot and utterance representations by fine-tuning a pretrained BERT model. Then they compute the similarity between possible values and utterances via a slot-utterance matching module.
• TRADE  employs an encoderdecoder architecture to generate the values for slots from the vocabulary and the dialogue history.
• DSTQA (Zhou and Small, 2019) models DST as a question answering problem, which generates a question to ask for the value of the slot at each turn.
• DS-DST (Zhang et al., 2019) proposes a Dual Strategy to combine the advantages of the picklist-based and span-based methods, which has been evaluated individually as DST-Picklist and DST-Span.
On MultiWOZ 2.0, as shown in Table 2, our model achieves the highest performance, 54.52% of joint accuracy in which VN gains 6.08% absolute improvement. And on MultiWOZ 2.1, as shown in Table 3, our model also achieves the highest performance, 54.86% of joint accuracy in which  VN gains 9.12% absolute improvement. Combining results from two tables, we demonstrate that the performance of SA is similar to TRADE and SA has about 5% higher absolute performance than DST-span, which is also a span-based method using BERT.
Comparing SA raw with SA sp , we find that their performance is similar without VN and the improvement of SA sp performance is obviously greater than that of SA raw by VN, which shows that SA sp has learned more about semantics so that it could output the supporting span and can be better combined with VN. Furthermore, by comparing VN 1 with VN 12 , we prove that VN has enough performance for the MultiWOZ dataset with only one transformer layer.

Incomplete Ontology
The experimental results in Table 2 and Table 3 show that ontology is a powerful resource for DST. However, it is impractical to get a full ontology in advance when the DST model is oriented to practical applications, which leads some models, such as TRADE, to abandon ontology. In this section, We choose SA sp on MultiWOZ 2.1 as the base model to evaluate the performance of VN 1 with incomplete ontology.
There are many slots in the ontology. We can divide them into two categories, common-value slots and special-value slots. Common-value slots, such as hotel-price and hotel-type, are able to include all possible values as long as a few values are given. And for special-value slots, such as restaurant-name and taxi-departure, it is difficult to cover all possible values by predefined values. In our experiment, we only drop out values in specialvalue slots and always keep all values of commonvalue slots. The results are shown in Table 4. Even if there is only 30% ontology, VN can bring positive performance as long as θ is appropriate. Based on the results, we demonstrate that the performance of VN has steadily improved with the increased usage rate of ontology. Furthermore, the more complete ontology is, the smaller θ can be, which means VN can be more dependent on ontology.

Error Analysis
An error analysis of SA sp with VN 1 on MultiWOZ 2.1 is shown in Figure 5. The three slots with the highest error rates are restaurant-name with 6.37%, hotel-type with 6.08% and attraction-name with 5.93%. Through the detailed analysis of error samples, we observe many labeled states do not include the name that only appears in system response. These states are similar to the example in Figure 1 with the restaurant-name and attractionname removed. Once the difference occurs, it will lead to errors in the subsequent dialogue states, resulting in high error rates. And the labels of hotel-type are found to be confusing. For instance, for the sentence "I am looking for a hotel with ...", sometimes the label of hotel-type is hotel and sometimes it is none.
Compared with SA, SAVN has significantly lower error rates on attraction-type and attractionname. The improvement of attraction-name is mainly due to the repair of spelling mistakes, and the improvement of attraction-type mainly benefits from the normalization of varied expressions. It is worth mentioning that VN can not improve the accuracy of some slots, which only need to be filled with yes or no, such as hotel-internet and hotel-parking.

Related Work
Traditional dialogue state tracking models extract utterance semantics by hand-crafted features and complex domain-specific lexicons (Wang and Lemon, 2013;Williams, 2014;Henderson et al., 2014) to predict the dialogue states, which is hard to adapt to new domains. Then, to overcome this drawback,  propose a novel Neural Belief Tracking (NBT) framework with learning n-gram representation of utterance by using a convolutional neural network, and achieve better performance. At the same time, Models for multi-domain DST have then been proposed. Rastogi et al. (2017) build a multi-domain DST model by two-layer bi- GRU and Ramadan et al. (2018) track domain and the dialogue states jointly through multiple bi-LSTM. They employ semantic similarity between utterances and the values in ontology and allow the knowledge to be shared across domains. To transfer knowledge between slots, Zhong et al. (2018) propose a global-local architecture to share parameters among slots and Ren et al. (2018) propose StateNet that shares all parameters among slots and fix the word embeddings during training to handle new slots.
After the pre-trained BERT model showed superior performance, encoding by BERT has become the mainstream.  encode the slots and utterances with BERT, and then compute the similarity between possible values and utterances after a Multi-head attention layer. And Zhang et al. (2019) also employ BERT to encode the utterances. The difference is that they combine the picklistbased and span-based methods and get higher performance. In order to eliminate the dependence on ontology,  propose an encoderdecoder architecture with a pointer network to generate the value for each slot. And Zhou and Small (2019) formulate multi-domain DST as a question answering problem and learn relationships between slots by a dynamically-evolving knowledge graph. Most recently, Heck et al. (2020) propose to use copy mechanisms to fill slots with values, which combine span-based methods with memory methods to avoid the use of value picklists.

Conclusion
We introduce a new architecture that divides the prediction of slots and the use of ontology. SA shares parameters not only among all slots but also between slots and utterances. And VN can handle ontology flexibly with a simple and effective structure, which is able to work with incomplete ontology. Combining SA with VN, SAVN has shown excellent performance on both MultiWOZ 2.0 and MultiWOZ 2.1. And we also introduce the annotation of supporting span. In future work, the supporting span annotation can be added to the datasets of a task-oriented dialog system, for the reason that supporting span serves as a bridge between diverse descriptions of users and the normative values in the system. Furthermore, DST models with supporting span allow for a fairer comparison regardless of whether the ontology is used.