Hybrid Dialog State Tracker with ASR Features

This paper presents a hybrid dialog state tracker enhanced by trainable Spoken Language Understanding (SLU) for slot-filling dialog systems. Our architecture is inspired by previously proposed neural-network-based belief-tracking systems. In addition, we extended some parts of our modular architecture with differentiable rules to allow end-to-end training. We hypothesize that these rules allow our tracker to generalize better than pure machine-learning based systems. For evaluation, we used the Dialog State Tracking Challenge (DSTC) 2 dataset - a popular belief tracking testbed with dialogs from restaurant information system. To our knowledge, our hybrid tracker sets a new state-of-the-art result in three out of four categories within the DSTC2.


Introduction
A belief-state tracker is an important component of dialog systems whose responsibility is to predict user's goals based on history of the dialog. Belief-state tracking was extensively studied in the Dialog State Tracking Challenge (DSTC) series (Williams et al., 2016) by providing shared testbed for various tracking approaches. The DSTC abstracts away the subsystems of end-toend spoken dialog systems, focusing only on the dialog state tracking. It does so by providing datasets of Automatic Speech Recognition (ASR) and Spoken Language Understanding (SLU) outputs with reference transcriptions, together with annotation on the level of dialog acts and user goals on slot-filling tasks where dialog system tries to fill predefined slots with values from a known ontology (e.g. moderate value for a pricerange slot).
In this work we improve state-of-the-art results on DSTC2 (Henderson et al., 2014a) by combining two central ideas previously proposed in different successful models: 1) machine learning core with hand-coded 1 rules, an idea already explored by Yu et al. (2015) and Vodolán et al. (2015) with 2) a complex neural network based architecture that processes ASR features proposed by Henderson et al. (2014b). Their network consist of two main units. One unit handles generic behaviour that is independent of the actual slot value and the other depends on slot value and can account for common confusions.
When compared to Henderson et al. (2014b) that inspired our work: 1) our model does not require auto-encoder pre-training and shared initial training on all slots which makes the training easier; 2) our approach combines a rule-based core of the tracker and RNNs while their model used only RNNs; 3) we use different NN architecture to process SLU features.
In the next section we describe the structure of our model, after that we detail how we evaluated the model on the DSTC2 dataset. We close the paper with a section on the lessons we learned.

Hybrid dialog state tracker model
The tracker operates separately on the probability distribution for each slot. Each turn, the tracker generates these distributions to reflect the user's goals based on the last action of the machine, the observed user actions, the probability distributions from the previous turn and an internal hidden state. The probability distribution h s t [v] is a distribution over all possible values v from the domain of slot s at dialog turn t. The joint belief state is represented by a probability distribution over the Cartesian product of the individual slot domains.  Figure 1: The structure of the Hybrid tracker at turn t. It is a recurrent model that uses the probability distribution h s t−1 and hidden state l s t−1 from the previous turn (recurrent information flow is depicted by dashed blue lines). Inputs of the machine-learned part of the model (represented by functions G and F based on recurrent L) are the turn and value features f t , f v and the hidden state. The features are used to produce transition coefficients a for the R function which transforms the output of the SLU u s t into belief h s t .
In the following notation i s t denotes a user action pre-processed into a probability distribution of informed values for the slot s and turn t. During the pre-processing, every Affirm() from the SLU is transformed into Inform(s=v) depending on a machine action of the turn. The f t denotes turn features consisting of unigrams, bigrams, and trigrams extracted from the ASR hypotheses N -best list. They are weighted by the probability of the corresponding hypothesis on the N -best list. The same approach is used in Henderson et al. (2014b). To make our system comparable to the best-performing tracker (Williams, 2014) we also included features from batch ASR (recognition hypotheses and the unigram word-confusion matrix). The batch ASR hypotheses are encoded in the same way as hypotheses from the regular ASR. The confusion matrix information is encoded as weighted unigrams. The last part of the turn features encodes machine-action dialog acts. We are using trigram-like encoding dialogact-slot-value with weight 1.0. The other features are value features f v i created from turn features, which contain occurrence of v i , by replacing occurrence of the value v i and slot name s by a common tag (inform-food-italian → inform-<slot>-<value>).

This technique is called delexicalization by Henderson et al. (2014b).
From a high-level perspective, our model consists of a rule-based core represented by a function R that specifies how the belief state evolves based on new observations. The rules R depend on the output of machine-learned SLU and on transition coefficients 2 a v i ,v j that specify how easy it would be to override a previously internalized slot value v j with a new value v i in the given situation. The a v i ,v j transition coefficients are computed as a sum of functions F and G where F accounts for generic value-independent behavior which can however be corrected by the valuedependent function G. The structure of the tracker is shown in Figure 1.
In the next subsection, we will describe the rule-based component of the Hybrid tracker. Afterwards, in Section 2.2, we will describe the machine-learned part of the tracker followed by the description of the trainable SLU in Section 2.3.

Rule-based part
The rule-based part of our tracker, inspired by Vodolán et al. (2015), is specified by a function R(h s t−1 , u s

Machine-learned part
The machine-learned part modulates behavior of the rule-based part R by transition coefficients Vodolán et al. (2015). However, our computation of the coefficients involves two different functions: where the function F controls generic behavior of the tracker, which does not take into account any features about v i or v j . On the other hand, function G provides value-dependent corrections to the generic behavior described by F .
Value Independent Model. F is specified as: (4) where the F function takes values of c new and c override from a function L. The function c new , c override , l t = L(l t−1 , f t ) is a recurrent function that takes its hidden state vector l t−1 from the previous turn and the turn features f t as input and it outputs two scalars c new , c override and a new hidden state l t . An interpretation of these scalar values is the following: • c new -describes how easy it would be to change the belief from hypothesis None to an instantiated slot value, • c override -models a goal change, that is, how easily it would be to override the current belief with a new observation.
In our implementation, L is formed by 5 LSTM (Hochreiter and Schmidhuber, 1997) cells with tanh activation. We use a recurrent network for L since it can learn to output different values of the c parameters for different parts of the dialog (e.g., it is more likely that a new hypothesis will arise at the beginning of a dialog). This way, the recurrent network influences the rule-based component of the tracker. The function L uses the turn features f t , which encode information from the ASR, machine actions and the currently tracked slot.
Value Dependent Model.
The function G(f t , v i , v j ) corrects the generic behavior of F . G is implemented as a multi-layer perceptron with linear activations, that is: The MLP uses turn features f t together with delexicalized features f v i for slot value v i . In our implementation the MLP computes a whole vector with values for each v k at once. However, in this notation we use just the value corresponding to v j . To stress this we use the restriction operator | v j . The features are processed by a bidirectional LSTM B (with 10 tanh activated cells) which enables the model to compare the likelihoods of the values in the user utterance. Even though this is not a standard usage of the LSTM it has proved as crucial especially for estimating the None value which means that no value from the ontology was mentioned 3 . The other benefit of this architecture is that it can weight its output u 1 according to how many ontology values have been detected during turn t.

Spoken Language Understanding part
However, not all ontology values can be replaced by tags because of speech-recognition errors or simply because the ontology representation is not the same as the representation in natural language (e.g. dontcare~it does not matter). For this purpose, the model uses a second unit that maps untagged features directly into a value vector u 2 . Because of its architecture, the unit is able to work only with ontology values seen during training. At the end, outputs u 1 , u 2 of the two units are summed together and turned into a probability distribution u via softmax. Since all parts of our model (R, F , G, SLU) are differentiable, all parameters of the model can be trained jointly by gradient-descent methods.

Evaluation
Method. From each dialog in the dstc2 train data (1612 dialogs) we extracted training samples for the slots food, pricerange and area and used all of them to train each tracker. The development data dstc2 dev (506 dialogs) were used to select the f t and f v features. We took the 2000 most frequent f t features and the 100 most frequent f v features.
The cost that we optimized consists of a tracking cost, which is computed as a cross-entropy between a belief state h s t and a goal annotation, and of an SLU cost, which is a cross-entropy between the output of the SLU u s t and a semantic annotation. We did not use any regularization on model parameters. We trained the model for 30 epochs by SGD with the AdaDelta (Zeiler, 2012) weightupdate rule and batch size 16 on fully unrolled dialogs. We use the model from the best iteration according to error rate on dstc2 dev. The evaluated model was an ensemble of 10 best trackers (according to the tracking accuracy on dstc2 dev) selected from 62 trained trackers. All trackers used the same training settings with difference in initial parameter weights only). Our tracker did not track the name slot because there are no training data available for it. Therefore, we always set value for the name to None.
Results. This section briefly summarizes results of our tracker on dstc2 test (1117 dialogs) in all DSTC2 categories as can be seen in Table 1. We also provide evaluation of the tracker without specific components to measure their contribution in the overall accuracy.
In the standard categories using Batch ASR and ASR features, we set new state-of-the-art results. In the category without ASR features (SLU only) our tracker is slightly behind the best tracker (Lee and Stent, 2016).
For completeness, we also evaluated our tracker in the "non-standard" category that involves trackers using test data for validation. This setup was proposed in Henderson et al. (2014a) where an ensemble was trained from all DSTC2 submissions. However, this methodology discards a direct comparison with the other categories since it can overfit to test data. Our tracker in this category is a weighted 4 averaging ensemble of trackers trained for the categories with ASR and batch ASR.
We also tested contribution of specialization components G and M by training new ensembles of models without those components. Accuracy of the ensembles can be seen in Table 1. From the results can be seen that removing either of the components hurts the performance in a similar way.
In the last part of evaluation we studied importance of the bidirectional LSTM layer B by ensembling models with linear layer instead. From the table we can see a significant drop in accuracy, showing the B is a crucial part of our model.

Lessons learned
Originally we designed the special SLU unit M with a sigmoid activation inspired by architecture of (Henderson et al., 2014b). However, we found it difficult to train because gradients were propagated poorly through that layer causing its output to resemble priors of ontology values rather than probabilities of informing some ontology value based on corresponding ASR hypotheses as suggested by the network hierarchy. The problem resulted in an inability to learn alternative wordings of ontology values which are often present in the training data. One such example can be "asian food" which appears 16 times in the training data as a part of the best ASR hypothesis while 13 times it really informs about "asian oriental" ontology value. Measurements on dstc2 dev have shown  Henderson et al. (2014b) .737 .406 Knowledge-based tracker (Kadlec et al., 2014) .737 .429 √ Sun et al. (2014) .735 .433 Smith (2014) .729 .452 Lee et al. (2014) .726 .427 YARBUS (Fix and Frezza-buet, 2015) . that the SLU was not able to recognize this alias anytime. We managed to solve this training issue by simplifying the special SLU sigmoid to linear activation instead. The resulting SLU is able to recognize common alternative wordings as "asian food" appearing more than 10 times in training data, as well as rare alternatives like "anywhere" (meaning area:dontcare) appearing only 5 times in training data.

Conclusion
We have presented an end-to-end trainable belief tracker with modular architecture enhanced by differentiable rules. The modular architecture of our tracker outperforms other approaches in almost all standard DSTC categories without large modifications making our tracker successful in a wide range of input-feature settings.