Encoding Word Confusion Networks with Recurrent Neural Networks for Dialog State Tracking

This paper presents our novel method to encode word confusion networks, which can represent a rich hypothesis space of automatic speech recognition systems, via recurrent neural networks. We demonstrate the utility of our approach for the task of dialog state tracking in spoken dialog systems that relies on automatic speech recognition output. Encoding confusion networks outperforms encoding the best hypothesis of the automatic speech recognition in a neural system for dialog state tracking on the well-known second Dialog State Tracking Challenge dataset.


Introduction
Spoken dialog systems (SDSs) allow users to naturally interact with machines through speech and are nowadays an important research direction, especially with the great success of automatic speech recognition (ASR) systems (Mohamed et al., 2012;Xiong et al., 2016). SDSs can be designed for generic purposes, e.g. smalltalk (Weizenbaum, 1966;Vinyals and Le, 2015)) or a specific task such as finding restaurants or booking flights (Bobrow et al., 1977;Wen et al., 2016). Here, we focus on task-oriented dialog systems, which assist the users to reach a certain goal.
Task-oriented dialog systems are often implemented in a modular architecture to break up the complex task of conducting dialogs into more manageable subtasks. Williams et al. (2016) describe the following prototypical set-up of such a modular architecture: First, an ASR system converts the spoken user utterance into text. Then, a spoken language understanding (SLU) module extracts the user's intent and coarse-grained semantic information. Next, a dialog state tracking (DST) component maintains a distribution over the state of the dialog, updating it in every turn. Given this information, the dialog policy manager decides on the next action of the system. Finally, a natural language generation (NLG) module forms the system reply that is converted into an audio signal via a text-to-speech synthesizer.
Error propagation poses a major problem in modular architectures as later components depend on the output of the previous steps. We show in this paper that DST suffers from ASR errors, which was also noted by Mrksic et al. (2017). One solution is to avoid modularity and instead perform joint reasoning over several subtasks, e.g. many DST systems directly operate on ASR output and do not rely on a separate SLU module (Henderson et al., 2014c;Mrksic et al., 2017;Perez, 2017). End-to-end systems that can be directly trained on dialogs without intermediate annotations have been proposed for open-domain dialog systems (Vinyals and Le, 2015). This is more difficult to realize for task-oriented systems as they often require domain knowledge and external databases. First steps into this direction were taken by Wen et al. (2016) and Zhao and Eskénazi (2016), yet these approaches do not integrate ASR into the joint reasoning process.
We take a first step towards integrating ASR in an end-to-end SDS by passing on a richer hypothesis space to subsequent components. Specifically, we investigate how the richer ASR hypothesis space can improve DST. We focus on these two components because they are at the beginning of the processing pipeline and provide vital information for the subsequent SDS components. Typically, ASR systems output the best hypothesis or an n-best list, which the majority of DST approaches so far uses (Williams, 2014;Henderson et al., 2014c;Mrksic et al., 2017;Zilka and Jurcícek, 2015). However, n-best lists can only represent a very limited amount of hypotheses. Internally, the ASR system maintains a rich hypothesis space in the form of speech lattices or confusion networks (cnets) 1 .
We adapt recently proposed algorithms to encode lattices with recurrent neural networks (RNNs) (Ladhak et al., 2016;Su et al., 2017) to encode cnets via an RNN based on Gated Recurrent Units (GRUs) to perform DST in a neural encoderclassifier system and show that this outperforms encoding only the best ASR hypothesis. We are aware of two DST approaches that incorporate posterior word-probabilities from cnets in addition to features derived from the n-best lists (Williams, 2014;Vodolán et al., 2017), but to the best of our knowledge, we develop the first DST system directly operating on cnets.

Proposed Model
Our model depicted in Figure 1 is based on an incremental DST system (Zilka and Jurcícek, 2015). It consists of an embedding layer for the words in the system and user utterances, followed by a fully connected layer composed of Rectified Linear Units (ReLUs) (Glorot et al., 2011), which yields the input to a recurrent layer to encode the system and user outputs in each turn with a softmax classifier on top. ⊕ denotes a weighted sum c j of the system dialog act s j and the user utterance u j , where W s , W u , and b are learned parameters: Independent experiments with the 1-best ASR output showed that a weighted sum of the system and user vector outperformed taking only the user vector u j as in the original model of Zilka and Jurcícek (2015). We chose this architecture over other successful DST approaches that operate on the turn-level of the dialogs (Henderson et al., 2014c;Mrksic et al., 2017) because it processes the system and user utterances word-byword, which makes it easy to replace the recurrent layer of the original version with the cnet encoder.
Our cnet encoder is inspired from two recently proposed algorithms to encode lattices with an RNN with standard memory (Ladhak et al., 2016) and a GRU-based RNN (Su et al., 2017). In contrast to lattices, every cnet state has only encoder classifier Figure 1: The proposed model with GRU-based cnet encoder for a dialog with three turns. d t are one-hot word vectors of the system dialog acts; w t i correspond to the word hypotheses in the timesteps of the cnets of the user utterances; s j , u j are the cnet GRU outputs at the end of each system or user utterance.
one predecessor and groups together the alternative word hypotheses of a fixed time interval (timestep). Therefore, our cnet encoder is conceptually simpler and easier to implement than the lattice encoders: The recurrent memory only needs to retain the hidden state of the previous timestep, while in the lattice encoder the hidden states of all previously processed lattice states must be kept in memory throughout the encoding process. Following Su et al. (2017), we use GRUs as they provide an extended memory compared to plain RNNs 2 . The cnet encoder reads in one timestep at a time as depicted in Figure 2. The key idea is to separately process each of the k word hypotheses representations x t i in a timestep with the standard GRU to obtain k · · · · · · · · · · · · · · · Figure 2: Encoding k alternative hypotheses at timestep t of a cnet.
hidden states h t i as defined in Equation (2) We experiment with the two different pooling functions f pool for the k hidden GRU states h t i of the alternative word hypotheses that were used by Ladhak et al. (2016): Instead of the system output in sentence form we use the dialog act representations in the form of dialog-act, slot, value triples, e.g. 'inform food Thai', which contain the same information in a more compact way. Following Mrksic et al. (2017), we initialize the word embeddings with 300-dimensional semantically specialized PARAGRAM-SL999 embeddings (Wieting et al., 2015). The hyper-parameters for our model are listed in the appendix.
3 Throughout the paper · denotes an element-wise product.
The cnet GRU subsumes a standard GRU-based RNN if each token in the input is represented as a timestep with a single hypothesis. We adopt this method for the system dialog acts and the baseline model that encode only the best ASR hypothesis.

Data
In our experiments, we use the dataset provided for the second Dialog State Tracking Challenge (DSTC2) (Henderson et al., 2014a) that consists of user interactions with an SDS in the restaurant domain. It encompasses 1612, 506, 1117 dialogs for training, development and testing, respectively. Every dialog turn is annotated with its dialog state encompassing the three goals for area (7 values), food (93 values) and price range (5 values) and 8 requestable slots, e.g. phone and address. We train on the manual transcripts and the cnets provided with the dataset and evaluate on the cnets.
Some system dialog acts in the DSTC2 dataset do not correspond to words and thus were not included in the pretrained word embeddings. Therefore, we manually constructed a mapping of dialog acts to words contained in the embeddings, where necessary, e.g. we mapped expl-conf to explicit confirm.
In order to estimate the potential of improving DST by cnets, we investigated the coverage of words from the manual transcripts for different ASR output types. As shown in Table 1, cnets improve the coverage of words from the transcripts by more than 15 percentage points over the best hypothesis and more than five percentage points over the 10-best hypotheses.
However, the cnets provided with the DSTC2 dataset are quite large. The average cnet consists of 23 timesteps with 5.5 hypotheses each, amounting to about 125 tokens, while the average best hypothesis contains four tokens. Manual inspection of the cnets revealed that they contain a lot of noise such as interjections (uh, oh, ...) that never appear in the 10-best lists. The appendix provides an exemplary cnet for illustration. To reduce the processing time and amount of noisy hypotheses, we remove all interjections and additionally experiment with pruning hypotheses with a score below a certain threshold. As shown in Table 1, this does not discard too many correct hypotheses but markedly reduces the size of the cnet to an average of seven timesteps with two hypotheses.

Results and Discussion
We report the joint goals and requests accuracy (all goals or requests are correct in a turn) according to the DSTC2 featured metric (Henderson et al., 2014a). We train each configuration 10 times with different random seeds and report the average, minimum and maximum accuracy.
To study the impact of ASR errors on DST, we trained and evaluated our model on the different user utterance representations provided in the DSTC2 dataset. Our baseline model uses the best hypothesis of the batch ASR system, which has a word error rate (WER) of 34% on the DSTC2 test set. Most DST approaches use the hypotheses of the live ASR system, which has a lower WER of 29%. We train our baseline on the batch ASR outputs as the cnets were also produced by this system. As can be seen from  Table 2: DSTC2 test set accuracy for 1-best ASR outputs of ten runs with different random seeds in the format average maximum minimum . Table 3 displays the results for our model evaluated on cnets for increasingly aggressive pruning levels (discarding only interjections, additionally discarding hypotheses with scores below 0.001 and 0.01, respectively). As can be seen, using the full cnet except for interjections does not improve over the baseline. We believe that the share of noisy hypotheses in the DSTC2 cnets is too high for our model to be able to concentrate on the correct hypotheses. However, when pruning low-probability hypotheses both pooling strategies improve over the baseline. Yet, average pooling performs worse for the lower pruning threshold, which shows that the model is still affected by noise among the hypotheses. Conversely, the model can exploit a rich but noisy hypothesis space by weighting the information retained from each hypothesis: Weighted pooling performs better for the lower pruning threshold of 0.001 with which we obtain the highest result overall, improving the joint goals accuracy by 1.6 percentage points compared to the baseline. Therefore, we conclude that is beneficial to use information from all alternatives and not just the highest scoring one but that it is necessary to incorporate the scores of the hypotheses and to prune low-probability hypotheses. Moreover, we see that an ensemble model that averages the predictions of ten cnet models trained with different random seeds also outperforms an ensemble of ten baseline models.

Results of the Model with Cnet Encoder
Although it would be interesting to compare the performance of cnets to full lattices, this is not possible with the original DSTC2 data as there were no lattices provided. This could be investigated in further experiments by running a new ASR system on the DSTC2 dataset to obtain both lattices and cnets. However, these results will not be comparable to previous results on this dataset due to the different ASR output.

Comparison to the State of the Art
The current state of the art on the DSTC2 dataset in terms of joint goals accuracy is an ensemble of neural models based on hand-crafted update rules and RNNs (Vodolán et al., 2017). Besides, this model uses a delexicalization mechanism that replaces mentions of slots or values from the DSTC2 ontology by a placeholder to learn valueindependent patterns (Henderson et al., 2014c,b). While this approach is suitable for small domains and languages with a simple morphology such as English, it becomes increasingly difficult to locate  Table 3: DSTC2 test set accuracy of ten runs with different random seeds in the format average maximum minimum . denotes a statistically significant higher result than the baseline (p < 0.05, Wilcoxon signed-rank test with Bonferroni correction for ten repeated comparisons). The cnet ensemble corresponds to the best cnet model with pruning threshold 0.001 and weighted pooling. words or phrases corresponding to slots or values in wider domains or languages with a rich morphology (Mrksic et al., 2017) and we therefore abstained from delexicalization.
The best result for the joint requests was obtained by a ranking model based on hand-crafted features, which relies on separate SLU systems besides ASR (Williams, 2014). SLU is often cast as sequence labeling problem, where each word in the utterance is annotated with its role in respect to the user's intent (Raymond, 2007;Vu et al., 2016), requiring training data with fine-grained word-level annotations in contrast to the turn-level dialog state annotations. Furthermore, a separate SLU component introduces an additional set of parameters to the SDS that has to be learned. Therefore, it has been argued to jointly perform SLU and DST in a single system (Henderson et al., 2014c), which we follow in this work.
As a more comparable reference for our setup, we provide the result of the neural DST system of Mrksic et al. (2017) that like our approach does not use outputs of a separate SLU system nor delexicalized features. Our ensemble models outperform Mrksic et al. (2017) for the joint requests but are a bit worse for the joint goals. We stress that our goal was not to reach for the state of the art but show that DST can benefit from encoding cnets.

Conclusion
As we show in this paper, ASR errors pose a major obstacle to accurate DST in SDSs. To reduce the error propagation, we suggest to exploit the rich ASR hypothesis space encoded in cnets that contain more correct hypotheses than conventionally used n-best lists. We develop a novel method to encode cnets via a GRU-based RNN and demonstrate that this leads to improved DST performance compared to encoding the best ASR hypothesis on the DSTC2 dataset.
In future experiments, we would like to explore further ways to leverage the scores of the hypotheses, for example by incorporating them as an independent feature rather than a direct weight in the model.