Using phone features to improve dialogue state tracking generalisation to unseen states

The generalisation of dialogue state tracking to unseen dialogue states can be very challenging. In a slot-based dialogue sys-tem, dialogue states lie in discrete space where distances between states cannot be computed. Therefore, the model parameters to track states unseen in the training data can only be estimated from more general statistics, under the assumption that every dialogue state will have the same underlying state tracking behaviour. However, this assumption is not valid. For example, two values, whose associated concepts have different ASR accuracy, may have different state tracking performance. Therefore, if the ASR performance of the concepts related to each value can be estimated, such estimates can be used as general features. The features will help to relate unseen dialogue states to states seen in the training data with similar ASR performance. Furthermore, if two phoneti-cally similar concepts have similar ASR performance, the features extracted from the phonetic structure of the concepts can be used to improve generalisation. In this paper, ASR and phonetic structure-related features are used to improve the dialogue state tracking generalisation to unseen states of an environmental control system developed for dysarthric speakers.


Introduction
Dialogue state tracking (DST) (Thomson and Young, 2010) is a key component for spoken interfaces for electronic devices. It maps the dialogue history up to the current dialogue turn (Spoken language understanding (SLU) output, actions taken by the device, etc.) to a probabilistic representation over the set of dialogue states 1 called the belief state (Young et al., 2013). This representation is the input later used by the dialogue policy to decide the next action to take (Williams and Young, 2007;Gašić and Young, 2014;Geist and Pietquin, 2011). In the Dialogue State Tracking Challenges (DSTC) , it was shown that data driven discriminative models for DST outperform generative models in the context of a slot based dialogue system. However, generalisation to unseen dialogue states (e.g. changing the dialogue domain or extending it) remains an issue. The 3rd DSTC (Henderson et al., 2014b) evaluated state trackers in extended domains, by including dialogue states not seen in the training data in the evaluation data. This challenge showed the difficulty for data-driven approaches to generalise to unseen states, as several machine learned trackers were outperformed by the rule-based baseline. Data driven state trackers with slot-specific models cannot handle unseen states. Therefore, general state trackers track each value independently using general value-specific features (Henderson et al., 2014c;Mrksic et al., 2015). However, dialogue states are by definition in discrete space where similarities cannot be computed. Thus, a general state tracker has to include a general valuetracking model that can combine the statistics of all dialogue states. This strategy assumes that different dialogue states have the same state tracking behaviour, but such assumption is rarely true. For example, two values, whose associated concepts have different ASR accuracy, have differ-ent state tracking performance. A general feature able to define similarities between dialogue states would improve state tracking generalisation to unseen states, as the new values could be tracked using statistics learned from the most similar states seen in the training data.
Dialogue management was shown to improve the performance of spoken control interfaces personalised to dysarthric speakers Casanueva et al., 2015). For these type of interfaces (e.g. homeService ), the user interacts with the system using single word commands 2 . Each slot-value in the system has its associated command. It is a reasonable assumption that two dialogue states or values associated to commands with similar ASR accuracy will also have similar DST performance. If the ASR performance of commands can be estimated (e.g. in a held out set of recordings), the measure can be used as a general feature to help the state tracker relate unseen dialogue states to similar states seen in the training data.
However, a held out set of recordings can be costly to obtain. If it is assumed that phonetically similar commands will have similar recognition rates, general features extracted from the phonetic structure of the commands can be used. For example, the ASR can find "problematic phones", i.e. phones or phone sequences that are consistently misrecognised. Therefore, the state tracker can learn to detect such problematic phones and adapt its dialogue state inference to the presence of these phones. If an unseen dialogue state that contains these phone patterns is tracked, the state tracker can infer the probability of that state more efficiently. Using the command phonetic structure as additional feature for state tracking can be interpreted as moving from state tracking in the "command space", where similarities between dialogue states cannot be computed, to state tracking in the "phone space", where those similarities can be estimated.
In this paper, we propose a method to use ASR and phone-related general features to improve the generalisation of a Recurrent Neural Network (RNN) based dialogue state tracker to unseen states. In the next section, state-of-theart methods for generalised state tracking are de-2 Severe dysarthric speakers cannot articulate complete sentences. scribed. Following section describes the proposed ASR and phone-related features as well as different approaches to encode variable length phone sequences into fixed length vectors. Section 4 describes the experimental set-up. Sections 5 and 6 present results and conclusions.

Generalised dialogue state tracking
In slot-based dialogue state tracking, the ontology defines the set of slots S and the set of possible values for each slot V s . A dialogue state tracker is hence a classifier, where classes correspond to the joint dialogue states. However, slot-based trackers often factorise the joint dialogue state into slots and therefore use a classifier to track each slot independently (Lee, 2013). Then, the set of values for that slot V s are the classes. The joint dialogue state is computed by multiplication and renormalisation of individual probabilities for each slot. Even if the factorisation of the dialogue state helps to generalise by reducing the number of effective dialogue states or values to track, slot specifically trained state trackers are not able to generalise to unseen values as they learn the specific statistics of each slot and value. State trackers able to generalise to unseen values track the probability of each value independently using value specific general features, such as the confidence score of the concept associated to that value in the SLU output (Henderson et al., 2014d).

Rule based state tracking
Rule-based state trackers (Wang and Lemon., 2013;Sun et al., 2014b) use slot-value independent rules to infer the probability of each dialogue state. An example is the sum of confidence scores of the concept related to that value or the answers confirming that the value is correct. Rule based methods show a competitive performance when evaluated in new or extended domains, as it was demonstrated in the 3rd DSTC. However, low adaptability can reduce the performance in domains that are challenging for ASR.

Slot-value independent data-driven state tracking
In the first two DSTC, most of the data driven approaches to dialogue state tracking learned specific statistics for each slot and value (Lee, 2013;Williams, 2014). However, in some cases (Lee and Eskenazi., 2013), parameter tying was used across slot models, thereby assuming that the statistics of two slots can be similar. The 3rd DSTC addressed domain extension, and state trackers able to generalise to unseen dialogue states had to be developed. One of the most successful approaches (Henderson et al., 2014d) combined the output of two RNNs trackers: one represented slot-specific statistics and the other modelled slot-value independent general statistics. Later, (Mrksic et al., 2015) modified this model to be able to track the dialogue state in completely different domains by using only the general part of the model of (Henderson et al., 2014d). The slot-value independent model (shown in Fig. 1) comprises of a set of binary classifiers or value filters 3 , one for each slotvalue pair, with parameters shared across all filters. These filters track each value independently, and the slot s output distribution in each turn is obtained by concatenating the outputs of each value filter g t v in V s , followed by applying a softmax function. The set of filters only differs from each other in two aspects: in the input composed by value specific general features (also called delexicalized features); and in the label used during the training. An RNN-based general state tracker 4 updates the probability of each value p t v in each turn t as follows: (1) Where h t v is the hidden state of each filter, x t v are the value specific inputs and W x , W h , b h , w g and b g are the parameters of the model.

ASR and phone-related general features
The model explained in section 2.2 works with value-specific general features x t v (e.g. the confidence score seen for that particular value in that turn). These features do not help to relate dialogue states with similar state tracking performance, thus the model has to learn the mean statistics from all the states. However, different values have different state tracking performance. Features that can give information about the ASR performance or that can be used to relate the state tracking performance of values seen in the training data to unseen states, should allow to generalise to new dialogue states. In the following section, we introduce various features that can improve generalisation.

ASR features
In a command-based environmental control system, if recordings of the commands related to the unseen dialogue states are available, they can be used to estimate the ASR performance for the new commands. Then, the value specific features for each filter can be extended by concatenating the ASR accuracy of that specific value. When the tracker faces a value not seen in the training data, it can improve the estimation of the probability of that value by using the statistics learnt form values with similar ASR performance.

Phone related features
In the previous section, accuracy estimates were proposed to improve general state tracking accuracy. However, these features would have to be inferred from a held out set of word recordings, which may not always be available. In order to avoid this requirement, the phonetic structure of the commands can be used to find similarities between dialogue states with similar ASR performance. The phonetic structure of the commands can be seen as a space composed by subunits of the commands, where similarities between states can be computed.
Phone related features can be extracted in several ways. A deep neural network trained jointly with the ASR can be used to extract a sequence of phone posterior features, one vector per speech frame (Christensen et al., 2013b). Another way is to use a pronunciation dictionary to decompose the output of the ASR into sequences of phones. The later method can be also used to extract a "phonetic fingerprint" of the associated value for each filter. For example, a filter which is tracking the value "RADIO", would have the sequence of phones "r-ey-d-iy-ow" as phonetic fingerprint.
In each dialogue turn, these features are based on sequences of different length. In the case of the ASR phone posteriors, the sequence length is equal to the number of speech frames. When using a pronunciation dictionary, the length is equal to the number of phonemes in the command. However, in each dialogue turn, a fixed length vector should be provided as input of the tracker. Thus, a method to transform these sequences into fixed length vectors is needed. A straightforward method is to compute the mean vector of the sequence, thereby loosing the phone order information. In addition, the number of phones that the sequence has would affect the value of each phone in the mean vector. To compress these sequences in fixed length vectors while maintaining the ordering and the phone length of the sequence, we propose to use a RNN encoder . We propose two ways to train this encoder, jointly with the model, and with a large pronunciation dictionary.

Joint RNN phone encoder
The state of an RNN is a vector representation of all the previous sequence inputs seen by the model. Therefore, the final state after processing a sequence can be seen as a fixed length encoding of the sequence. If this encoding is put to the filters of the state tracker (Fig. 2), the tracker and the encoder can be trained jointly using backpropagation. We propose to concatenate the encoding of the phonetic sequence in each turn with the value specific features x t v for each filter as shown in Fig. 2. This defines a structure with two stacked RNNs, one encoding the phonetic sequences per turn and the other processing the sequence of dialogue turns.

Seq2seq phone encoder
The need to encode the phone sequences into fixed length "dense" representations which allow to compute similarities, resembles the computing of word embeddings (Mikolov et al., 2013). The difference lies in the fact that word embedding transforms one-hot encodings of words into dense vectors, while in the scope of this work we transform sequences of one-hot encodings of phones into dense vectors. Sequence to sequence models (a.k.a. seq2seq models, RNN encoder-decoders), can be used to perform such a task. These models consist of two RNNs; an encoder which processes the input sequence into a fixed length vector (the final RNN state); and a decoder, which "unrolls" the encoded state into an output sequence (Fig. 3). These models have shown state-of-the-art performance in machine translation tasks , and have been applied to text-based dialogue management with promising results (Lowe et al., 2015;Wen et al., 2016). For the task of generating dense representations of phone sequences, the seq2seq model is trained in a similar way to auto-encoders (Vincent et al., 2008), where in- put and target sequences are the same, forcing the model to learn to reconstruct the input sequence.
The final state of the encoder RNN (the two-line block in Fig. 3) is taken as dense representation of the phone sequence. For this task, the combilex pronunciation dictionary (Richmond et al., 2010) is used to train the model. An RNN composed of two layers of 20 LSTM units is able to reconstruct 95% of the phone sequences in an independent evaluation set. This means compressing sequences of one-hot vectors of size 45 (the number of phones in US English) into a vector of size 20. In Fig. 4, the cosine distance between the dense phone representations of two sets of words of the UASpeech database (see sec. 4.1.1) is plotted, illustrating that these encodings are able to effectively relate words with similar phone composition.

Experimental setup
The experiments are performed within the context of a voice-enabled control system designed to help speakers with dysarthria to interact with their home devices Casanueva et al., 2016). The user can interact with the system in a mixed initiative way, speaking single-word commands from a total set of 36. As the ASR is configured to recognise single words (Christensen et al., 2012), the SLU operates a direct mapping from the ASR output, an N-Best list of words, to an N-Best list of commands. The dialogue state of the system is factorized into three slots, with the values of the first slot representing the devices to control (TV, light, bluray...), the second slot its functionalities (channel, volume...) and the third slot the actions that these functionalities can perform (up, two, off...). The slots have 4, 17 and 15 values respectively, and the combination of the values of the three slots compose the joint dialogue state or goal (e.g. TV-channel-five, blurayvolume-up). The set of valid 5 joint goals J has a cardinality of 63, and the belief state for each joint goal j is obtained by multiplying the slot probabilities of each of the individual slot values and normalising: where P sx (j x ) is the probability of the value j x in slot s x and j = (j 1 , j 2 , j 3 ).

Dialogue corpus
One of the main problems in dialogue management research is the lack of annotated dialogue corpora. The corpora released for the first three DSTCs aimed to mitigate this problem. However, this corpus does not include acoustic data. Hence, features extracted from the acoustics such as phone posteriors cannot be used. A large part of dialogue management research relies on simulated users (SU) (Georgila et al., 2006;Schatzmann et al., 2007;Thomson et al., 2012) for collection of the data needed. The dialogue corpus used in the following experiments has been generated with simulated users interacting with a rule based dialogue manager. To simulate data collected from dysarthric speakers, a set of 6 SUs with dysarthria has been created. To simulate data in two different domains, two environmental control systems are simulated, each controlled with a different vocabulary of 36 commands. 72 commands selected from the set of 155 more frequent words in the UASpeech database (Kim et al., 2008), and split into 2 groups, which are named domain A and domain B. 1000 dialogues are collected in each domain 6 . To be sure that the methods work independently of the set of commands selected, 3 different vocabularies of 72 words are randomly selected and the results presented in the following section show the mean results for the 3 vocabularies.

Simulated dysarthric users
Each SU is composed of a behaviour simulator and an ASR simulator. The behaviour simulator decides on the commands uttered by the SU in each turn. It is rule-based and depending on the machine action, it chooses a command corresponding to the value of a slot or answers a confirmation question. To simulate confusions by the user, it uses a probability of producing a different command, or of providing a value for a different slot than the requested one. The probabilities of confusion vary to simulate different expertise levels with the system. Three different levels are used to generate the corpus to increase its variability.
The ASR simulator generates ASR N-best outputs. These N-best lists are sampled from ASR outputs of commands uttered by dysarthric speakers from the UASpeech database, using the ASR model presented in . To increase the variability of the data generated, the time scale of each recording is modified to 10% and 20% slower and 10% and 20% faster, generating more ASR outputs to sample from. Phone posterior features are generated as described in (Christensen et al., 2013b) without the principal component analysis (PCA) dimensionality reduction. Six different SUs, corresponding to lowand mid-intelligible speakers, are created from the UASpeech database. ASR accuracy on these users ranges from 32% to 60%.

Rule-based state tracker
One of the trackers used in the DSTCs as baseline (Wang and Lemon., 2013) has been used to collect the corpus. This baseline tracker performed competitively in the 3 DSTCs, proving its capability to generalise to unseen states. The state tracking accuracy of this tracker is also used as the baseline in the following experiments.

Rule-based dialogue policy
The dialogue policy used to collect the corpus follows simple rules to decide the action to take in each turn. For each slot, if the maximum belief of that slot is below a threshold, the system will ask for that slot's value. If the belief is above that threshold but below a second one, it will confirm the value. If the maximum beliefs of all slots are above the second threshold, it will take the action corresponding to the joint goal with the highest probability. The thresholds values are optimised by grid search to maximise the dialogue reward. In addition, the policy implements a stochastic be-haviour to induce variability in the collected data; choosing a different action with probability p and requesting the values of the slots in a different order. The corpus is collected using two different policy parameter sets.

General LSTM-based state tracker
A general dialogue state tracker, based on the model described on section 2.2, has been implemented. Each value filter is composed by a linear feedforward layer of size 20 and a LSTM (Hochreiter and Schmidhuber, 1997) layer of size 30. Dropout (Srivastava et al., 2014) regularisation is used in order to reduce overfitting with dropout rate of 0.2 in the input connections and 0.5 in the remaining non-recurrent connections. The models are trained for 60 iterations with stochastic gradient descent. A validation set consisting on 20% of the training data is used to choose the parameter set corresponding to the best iteration. Model combination is also used to avoid overfitting. Every model is trained with 3 different seeds, and 5 different parameter sets are saved for each seed, one for the best iteration in the first 20, and then another for the best iteration in each interval of 10 iterations.

ASR and phone related general features
In each turn t, each value-specific state tracker (filter) takes as input the value-specific input features x t v . In this model, these correspond to the confidence score of the command related to the specific value, the confidence scores of the metacommands such as "yes" or "no" and a one-hot encoding of the last system action. In addition, the models are evaluated concatenating the value specific features x t v with the following ASR and phone related general features z t v : •ValAcc: The ASR performance of the command corresponding to the value of the tracker can be used as general feature. In this paper, the accuracy per command is used, defining z t v as the estimated ASR accuracy of the value v.
•PhSeq: A weighted sequence of phones is generated form the ASR output (N-best list of commands) as described below. A pronunciation dictionary is used to translate each word into a sequence of one-hot encodings of phones (the size of the one-hot encoding is 45, as the number of phones in US English). Each of these encodings is weighted by the confidence score of that command in the N-best list. This sequence is fed into an RNN as explained in section 3.2.1, and z t v is de-  (Chung et al., 2014) layer of size 15.
•PostSeq: A sequence of vectors (one vector per speech frame) with monophone-state level posterior probabilities are extracted from the output layers of a Deep Neural Network trained on the UASpeech corpus. The extracted vectors contain the posteriors of each of the 3 states (initial, central, and final) for the 45 phones of US English. To reduce the dimensionality of vectors, the posteriors of the each phone states are merged by summing them. To reduce the length of the sequence, the mean of each group of 10 speech frames is taken. This produces a sequence of vectors of size 45 and maximum length of 20, which is fed into an RNN in the same way as PhSeq features to obtain z t v .
•ValPhEnc: For each value filter, z t v is defined as the 20 dimensional encoding of the sequence of phones of the command associated to the value v, extracted from the seq2seq model defined in section 3.2.2. The encoder and decoder RNNs of the seq2seq model are composed of two layers of 20 LSTM units and the model is trained on the combilex dictionary (Richmond et al., 2010).
Note that two different kinds of features can be distinguished; value identity features and ASR output features. Value identity features (ValAcc and ValPhEnc) give information about the value tracked. These features are different for each filter (as each filter has a different associated value), but they do not change over turns (time invariant). ASR output features (PhSeq and PostSeq), on the other hand, give information about the ASR output observed. They are the same for each filter but change in each dialogue turn.

Results
The results presented are the joint state tracking accuracy, the accuracy of each individual slot and the mean accuracy of the 3 slots. This is because it was found that the relation between the mean slot accuracy and the joint accuracy is highly nonlinear, due to the high dependency on the ontology of the joint goals, while the costs optimized are related to the mean accuracy of the slots 7 . All the following numbers represent the average results for the models tested with the 6 simulated users described in sec. 4.1.1. Table 1 presents the accuracy results for the model described in section 4.2, using only value specific general features (General) and using the different features described in section 4.2.1. The models are trained on data from domain A and evaluated on data from domain B. Baseline presents the state tracking accuracy for the rulebased state tracker presented in section 4.1.2. It can be seen that the General tracker outperforms the baseline by more than 10%, suggesting that the baseline tracker does not perform well in ASR challenging environments. As it is expected, including the accuracy estimates (ValAcc) outperforms all the other approaches, especially on the joint goal. Including PhSeq features has a slightly worse performance in the joint but outperforms the General features in the mean slot accuracy. Comparing the slot by slot results, it can be seen that PhSeq features outperform General features in slot 1 accuracy by almost 2% while having similar behaviour in the other 2 slots. PostSeq features have a performance very similar to PhSeq, suggesting that both features carry very similar information. Surprisingly, ValPhEnc and PhSeq-  To partially deal with this problem, Table 2 shows the accuracy results when 200 dialogues from domain B are included in the training data. Including these dialogues in the training data has a very slight effect with the General and PhSeq features. ValPhEnc features, however, show a large improvement, outperforming General features by 4% in the joint goal and more than 5% in the mean slot accuracy. This improvement is seen in all the slots individually. To be sure that the model is not just learning the identities of the words, ValId features extend the General features including a onehot encoding of the word identity. As it can be seen, even if the performance in the joint goal is very low the mean slot accuracy improves the performance of General features by 2%. However, it is still more than 3% below the ValPhEnc features, showing that ValPhEnc features are not just learning the value identity, they are effectively correlating the performance of values similar in the phone encoding space. Finally, including the concatenation of PhSeq and ValPhEnc features, outperforms all the other approaches, even ValAcc features for the mean slot accuracy by more than 4%.

Conclusions
This paper has shown how the generalisation to unseen states of a dialogue state tracker can be improved by extending the value specific fea-tures with ASR accuracy estimates. Using an RNN encoder jointly trained with the general state tracker to encode phone-related sequential features slightly improved state tracking generalisation. However, when the model was trained using dense representations of phone sequences encoded with a seq2seq model, the tracker strongly overfitted to the training data, even if dropout regularization and model combination was used. This might be caused by the small variability of the command vocabulary (36 commands in each domain), which was not large enough for the model to find useful correlations between phone encodings. When a small amount of data from the unseen domain was included into the training data, phone encodings greatly boosted performance. This showed that phone encodings are useful as dense representations of the phonetic structure of the command, helping the model correlate state tracking performance of values close in the phonetic encoding space. This method was tested on a singleword command-based environmental control interface, where slot-value accuracies can easily be estimated. In addition, in this domain, the sequences of phonetic features are usually short. However, this method could be adapted to larger spoken dialogue systems by estimating the concept error rate of the SLU output of concepts related to slot-value pairs. Longer phonetic feature sequences could also be used to detect "problematic phones", or correlate sentences with similar phonetic composition, given enough variability of the training dataset to avoid overfitting.