Cost-Sensitive Active Learning for Dialogue State Tracking

Dialogue state tracking (DST), when formulated as a supervised learning problem, relies on labelled data. Since dialogue state annotation usually requires labelling all turns of a single dialogue and utilizing context information, it is very expensive to annotate all available unlabelled data. In this paper, a novel cost-sensitive active learning framework is proposed based on a set of new dialogue-level query strategies. This is the first attempt to apply active learning for dialogue state tracking. Experiments on DSTC2 show that active learning with mixed data query strategies can effectively achieve the same DST performance with significantly less data annotation compared to traditional training approaches.


Introduction
The dialogue state tracker, an important component of a spoken dialogue system, tracks the internal belief state of the system based on the history of the dialogue. For each turn, the tracker outputs a distribution over possible dialogue states, on which the dialogue system relies to take proper actions to interact with users. Various approaches have been proposed for dialogue state tracking, including hand-crafted rules (Wang and Lemon, 2013;Sun et al., 2014), generative models  and discriminative models (Ren et al., 2013;Lee and Eskenazi, 2013;Williams, 2014). For discriminative models, recent studies on data-driven approaches have shown promising performance, especially on Recurrent Neural Network (RNN) (Henderson et al., 2013(Henderson et al., , 2014c. As for datasets, the Dialog State Tracking Chal-lenge (DSTC) series (Williams et al., 2016) have provided common testbeds for this task.
Though data-driven approaches have achieved promising performance, they require large amounts of labelled data, which are costly to be fully annotated. Besides this, it is quite difficult to label a single dialogue because, for every dialogue turn, experts need to label all the semantic slots and typically, to label a single turn accurately, they need to pay attention to the context rather than the current turn only. Active learning (AL) (Settles, 2010) can be applied to select valuable samples to label. Using the AL approach, we need fewer labelled samples when training the model to reach the same or even better performance compared to traditional training approaches.
Although it is often assumed that the labelling costs are the same for all samples in some tasks (González-Rubio and Casacuberta, 2014;Sivaraman and Trivedi, 2014), it is appropriate to consider different labelling costs for the dialogue state tracking task where different dialogues vary in the number of turns. In this paper, we define the labelling cost for each dialogue sample with respect to its number of dialogue turns. Then we provide a new AL query criterion called diversity, and finally propose a novel cost-sensitive active learning approach based on three dimensions: cost, uncertainty, and diversity. The results of experiments on the DSTC2 dataset (Henderson et al., 2014a) demonstrate that our approaches are more effective compared to traditional training methods.
In the next section, we will present the proposed cost-sensitive active learning framework for dialogue state tracking. Then in Section 3 we will describe the experimental setup and show the results of experiments on the DSTC2 dataset, followed by our conclusions and future work in Section 4.

Cost-Sensitive Active Learning
A complete work cycle of active learning for dialogue state tracking includes 3 steps: (1) train the tracker with labelled dialogue samples; (2) post query using the query strategy to select the valuable unlabelled dialogue and ask experts for its label; (3) merge the newly-labelled dialogue with all previously-labelled dialogue samples and return to (1). The tracker and query strategy will be introduced in Section 2.1 and 2.2 respectively.

Dialogue State Tracker
Our proposed active learning workflow is independent of the tracker type. Here we use the Lec-Track model (Zilka and Jurcicek, 2015) as a wordbased tracker. For each turn t in a dialogue, the tracker takes in a word concatenation of all history words (together with their confidence scores from ASR) within this dialogue and finally outputs a prediction. The general model structure (at turn t) is shown in Figure 1. The notation w ⊕ u denotes the concatenation of two vectors, w (the word) and u (the confidence score). FC refers to the Fully Connected Layer. The output of FC is then encoded by the LSTM encoder Enc, whose output (only the last one) will be sent to a Softmax layer to make a prediction p s t ∈ R Ns over all N s possible values for a given slot s at turn t: For each slot s, at turn t < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 o 7 9 7 0 8 M X I 2 + U O l T w i C x z F A 9 w Q I = " > A A A C E X i c b V D L S g N B E J z 1 G e M r 6 t H L Y B A U J O y K o N 6 C g n i M Y E w g C W F 2 0 p s M m Z 1 Z Z n r F s O Q b v P g r X j y o e P X m z b 9 x 8 j j 4 q l N R 1 U 1 X V 5 h I Y d H 3 P 7 2 Z 2 b n 5 h c X c U n 5 5 Z X V t v b C x e W N 1 a j h U u Z b a 1 E N m Q Q o F V R Q o o Z 4 Y Y H E o o R b 2 z 0 d + 7 R a M F V p d 4 y C B V s y 6 S k S C M 3 R S u 7 D f R L j D 7 E I b C o z 3 q J U a 6 Z D a i X x A G V J M j a J D b B e K f s k f g / 4 l w Z Q U y R S V d u G j 2 d E 8 j U E h l 8 z a R u A n 2 M q Y Q c E l D P P N 1 E L C e J 9 1 o e G o Y j H Y V j Z + a U h 3 n d K h k Y s V a Y V 0 r H 7 f y F h s 7 S A O 3 W T M s G d / e y P x P 6 + R Y n T S y o R K U g T F J 4 e i V F L U d N Q P 7 Q g D H O X A E c a N c F k p 7 z H D O L o W 8 6 6 E 4 P f L f 0 n 1 s H R a C q 6 O i u W z a R s 5 s k 1 2 y B 4 J y D E p k 0 t S I V X C y T 1 5 J M / k x X v w n r x X 7 2 0 y O u N N d 7 b I D 3 j v X y M b n V Y = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 o 7 9 7 0 8 M X I 2 + U O l T w i C x z F A 9 w Q I = " > A A A C E X i c b V D L S g N B E J z 1 G e M r 6 t H L Y B A U J O y K o N 6 C g n i M Y E w g C W F 2 0 p s M m Z 1 Z Z n r F s O Q b v P g r X j y o e P X m z b 9 x 8 j j 4 q l N R 1 U 1 X V 5 h I Y d H 3 P 7 2 Z 2 b n 5 h c X c U n 5 5 Z X V t v b C x e W N 1 a j h U u Z b a 1 E N m Q Q o F V R Q o o Z 4 Y Y H E o o R b 2 z 0 d + 7 R a M F V p d 4 y C B V s y 6 S k S C M 3 R S u 7 D f R L j D 7 E I b C o z 3 q J U a 6 Z D a i X x A G V J M j a J D b B e K f s k f g / 4 l w Z Q U y R S V d u G j 2 d E 8 j U E h l 8 z a R u A n 2 M q Y Q c E l D P P N 1 E L C e J 9 1 o e G o Y j H Y V j Z + a U h 3 n d K h k Y s V a Y V 0 r H 7 f y F h s 7 S A O 3 W T M s G d / e y P x P 6 + R Y n T S y o R K U g T F J 4 e i V F L U d N Q P 7 Q g D H O X A E c a N c F k p 7 z H D O L o W 8 6 6 E 4 P f L f 0 n 1 s H R a C q 6 O i u W z a R s 5 s k 1 2 y B 4 J y D E p k 0 t S I V X C y T 1 5 J M / k x X v w n r x X 7 2 0 y O u N N d 7 b I D 3 j v X y M b n V Y = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 o 7 9 7 0 8 M X I 2 + U O l T w i C x z F A 9 w Q I = " > A A A C E X i c b V D L S g N B E J z 1 G e M r 6 t H L Y B A U J O y K o N 6 C g n i M Y E w g C W F 2 0 p s M m Z 1 Z Z n r F s O Q b v P g r X j y o e P X m z b 9 x 8 j j 4 q l N R 1 U 1 X V 5 h I Y d H 3 P 7 2 Z 2 b n 5 h c X c U n 5 5 Z X E 4 P f L f 0 n 1 s H R a C q 6 O i u W z a R s 5 s k 1 2 y B 4 J y D E p k 0 t S I V X C y T 1 5 J M / k x X v w n r x X 7 2 0 y O u N N d 7 b I D 3 j v X y M b n V Y = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 o 7 9 7 0 8 M X I 2 + U O l T w i C x z F A 9 w Q I = " > A A A C E X i c b V D L S g N B E J z 1 G e M r 6 t H L Y B A U J O y K o N 6 C g n i M Y E w g C W F 2 0 p s M m Z 1 Z Z n r F s O Q b v P g r X j y o e P X m z b 9 x 8 j j 4 q l N R 1 U 1 X V 5 h I Y d H 3 P 7 2 Z 2 b n 5 h c X c U n 5 5 Z X E 4 P f L f 0 n 1 s H R a C q 6 O i u W z a R s 5 s k 1 2 y B 4 J y D E p k 0 t S I V X C y T 1 5 J M / k x X v w n r x X 7 2 0 y O u N N d 7 b I D 3 j v X y M b n V Y = < / l a t e x i t > p s t < l a t e x i t s h a 1 _ b a s e 6 4 = " Q Y S h f F n i u c 0 E 9 w i E G t Y C g 9 e 7 p A g

Cost-Sensitive Active Learning Methods
Given that different dialogues vary in the number of turns, we assume that the smallest query unit should be a whole dialogue, and that the cost for labelling a dialogue is directly proportional to the number of dialogue turns.
Since all the unlabelled data is possible to be collected simultaneously, the DST task can be regarded as pool-based sampling (Settles, 2010). This assumes that there is a small pool of labelled data L, and a large pool of unlabelled data U available. That allows us to query the samples in a greedy fashion according to some measurement criteria, which are used to evaluate all samples in the unlabelled pool.
We propose four novel query strategies. The first three utilize one kind of measurement criterion respectively and the last one is based on the mixture of the first three. For convenience, a certain dialogue sample is denoted as x.

Cost Strategy (CS)
This strategy prefers the dialogue samples that have the minimum number of turns. For each dialogue sample, its labelling cost, denoted as C(x), can be defined as the number of turns.

Uncertainty Strategy (US)
This strategy prefers the dialogue samples whose predictions the DST model is most uncertain about. In this paper, we take advantage of entropy (Shannon, 2001) as the uncertainty measurement criterion. The dialogue uncertainty on slot s, U s (x), is the average over all the entropy values of dialogue turns: where p s t can be directly obtained from the DST model described in Section 2.1.

Diversity Strategy (DS)
This strategy prefers the dialogue samples that are most diverse from the dialogues currently in the labelled pool L. As the training and querying process goes on, the diversity of dialogue samples selected to be labelled will decrease gradually, which results in a biased training process. To handle such problem, here we design a novel Spherical k-Means Clustering (MacQueen et al., 1967) based method to evaluate the diversity of dialogue samples and select the most diverse ones in unlabelled pool U to label, so that we could maintain the diversity of dialogue samples in labelled pool L.
Different dialogues have varying lengths so an embedding function to map each dialogue into a fixed-dimensional continuous space is needed. We utilize the method of unsupervised dialogue embeddings (Su et al., 2016) to extract a dialogue feature, which is used to calculate the diversity.
We choose the bag-of-words (BOW) representation as a turn-level feature f t at turn t, which will be sent into a Bi-directional LSTM (BLSTM) (Graves et al., 2013) encoder to obtain the two directional hidden sequences, h f . Then, the dialogue feature vector v is calculated as the average over all hidden sequences, i.e., v = 1 where the notation h f t ⊕ h b t denotes the concatenation of the two vectors, h f t and h b t . Next, the dialogue feature vector is chosen as the input of a forward LSTM decoder for each turn t, which ultimately outputs feature sequences f 1:T . The model's training target is to minimize the mean-square-error (MSE) between f 1:T and f 1:T .
The feature vectors of all dialogues in both L and U can be obtained with this pre-trained model. Define V L = {v l 1 , v l 2 , · · · } as the feature vector set of L and V U = {v u 1 , v u 2 , · · · } as the feature vector set of U.
We fit the set V L into a Spherical k-Means model with N c clusters, so that we can acquire a substitutional set of feature vectors denoted as V L = {v l 1 , v l 2 , · · · , v l Nc }, which is composed of N c representative feature vectors (clusters) among the vectors in the original set V L . Then for each vector v u i in V U , calculate its cosine similarity to N c vectors in V L respectively, and regard the maximum of N c similarity values as its true similarity to the whole labelled set, since the cluster of maximum similarity has the largest representativeness of all the original vectors in the labelled set. Therefore, the diversity measure D(x) can be defined as the opposite number of similarity:

Mixed Strategy
In practice, we usually need different query strategies at different learning stages (Settles, 2010). Based on the strategies presented above, we propose a new query strategy called Cost-Uncertainty-Diversity Strategy (CUDS), which is originated from the idea of combining multiple strategies. This strategy takes into consideration three measurement criteria, i.e. cost, uncertainty and diversity, so that the unlabelled samples can be evaluated more comprehensively. Specifically, what we want is to select samples with low cost, high uncertainty and high diversity. Based on this, we propose a new measurement criterion, denoted as M (x). Naturally, the goal of CUDS is to pick out the dialogue samples which have the maximum measurement value M (x): where α, β and γ are positive weighting parameters that can be tuned so as to find a good trade-off among those three measurement criteria. In order to conduct weighting, C(x), U s (x) and D(x) should possess the same scale. C(x) ranges from 1 to C max , the maximum number of a single dialogue's turns, and therefore we replace

Experimental Setup
Experiments are conducted to assess the performance of different query strategies on single slot and joint goal respectively. The dataset we use is the second Dialogue State Tracking Challenge (DSTC2) dataset (Henderson et al., 2014a), which focuses on the restaurant information domain and contains 7 slots of which 4 are informable and all 7 are requestable. We implement the dialogue state tracker as described in Section 2.1. The model is trained using Stochastic Gradient Descent (SGD), collaborating with a gradient clipping heuristic (Pascanu et al., 2012) to avoid the exploding gradient problems.

Results on Single Slot
In this section, five different query strategies are compared on single slot. Besides the four query strategies presented in Section 2.2, here we choose Random Strategy (RS) as our baseline query strategy. RS means we randomly select dialogues to annotate. Although it may seem quite simple, we have to point out that such naive strategy does perform not bad in practice. We attribute such phenomenon to the fact that the query process is dominated by the underlying distribution of the original dataset. A nature of AL called sampling bias (Dasgupta and Hsu, 2008) can be considered as the main cause. The training set may gradually diverge from the real data distribution as the training and querying process continues. However, RS is luckily not influenced by this effect, which allows it to be a powerful baseline to compare with.
According to the current strategy, the model queries 2 dialogue samples each time. Figure 2 displays the training accuracy curves of the food slot (the most difficult slot) using different query strategies. Here we use the number of labelled dialogue turns as the x-axis, which can be regarded as the labelling cost. It is shown that the three query strategies (CS/US/DS), which are based on single measurement criterion respectively, have better performance than the baseline strategy (RS). The reason why DS does not perform very well at the beginning is that the diverse but greatly scarce data is not sufficient to train an effective model. Our proposed mixed strategy CUDS achieves the best performance among all the query strategies, which proves the effectiveness of our strategy mixture methodology. Considering the training cost, although the DSTC2 training set is composed of 11677 turns (1612 dialogues

Results on Joint Goal
Figure 3 displays the training accuracy curves of the joint goal using five different query strategies. At different learning stages, the query strategy of best performance is different. US rises more quickly at the beginning while DS diverges earlier.
The reasons include: in order to finally reach convergence, the tracker need to see samples of great diversity, which allows it to give consideration to several semantic slots; however, samples of large entropy can bring tracker more concrete information on the most controversial cases, which helps it to learn from scratch rapidly. Our mixed strategy CUDS, combining the advantages of those two while minimizing the cost at the same time, obtains a performance improvement. Although the final convergence level is not quite high due to the limitation of LecTrack model, it does not diminish the effectiveness of proposed AL query strategies.

Conclusions and Future Work
In this paper, a novel cost-sensitive active learning technique is presented for dialogue state tracking, which assumes that each dialogue sample has a nonuniform labelling cost related to the number of dialogue turns. Besides cost, we also provide another two measurement criteria, uncertainty and diversity. Our mixed query strategy considers those three criteria comprehensively in order to make queries more appropriately. Experiment results demonstrate that our proposed approaches can achieve promising tracking performance with lower cost compared to traditional methods.
Our future work roughly includes two parts. One is to deploy our proposed AL methods on some other dialogue tasks such as DSTC3 (Henderson et al., 2014b) to verify the results presented in this paper. The other is to conduct our approaches on DST models of better performance (Mrkšić et al., 2017) because the model's tracking ability has an inevitable influence on the whole active learning process.