Neural Belief Tracker: Data-Driven Dialogue State Tracking

One of the core components of modern spoken dialogue systems is the belief tracker, which estimates the user’s goal at every step of the dialogue. However, most current approaches have difficulty scaling to larger, more complex dialogue domains. This is due to their dependency on either: a) Spoken Language Understanding models that require large amounts of annotated training data; or b) hand-crafted lexicons for capturing some of the linguistic variation in users’ language. We propose a novel Neural Belief Tracking (NBT) framework which overcomes these problems by building on recent advances in representation learning. NBT models reason over pre-trained word vectors, learning to compose them into distributed representations of user utterances and dialogue context. Our evaluation on two datasets shows that this approach surpasses past limitations, matching the performance of state-of-the-art models which rely on hand-crafted semantic lexicons and outperforming them when such lexicons are not provided.


Introduction
Spoken dialogue systems (SDS) allow users to interact with computer applications through conversation.Task-oriented dialogue systems help users achieve goals such as finding restaurants or booking flights.The dialogue state tracking (DST) component of the SDS serves to interpret user input and update the belief state, which is the system's internal representation of the state of the conversation.This is a probability distribution over dialogue states used by the dialogue manager component to decide which action the system should perform next.The Dialogue State Tracking Challenge (DSTC) series of shared tasks has provided a common evaluation framework accompanied by labelled datasets (Williams et al., 2016).In this framework, systems must track the search constraints expressed by users (goals) and questions the users ask about search results (requests), taking into account the user's utterance (passed through a speech recogniser) and the dialogue context (e.g., what the system just said).The following example shows the true state after each user utterance in a three-turn conversation: User: I'm looking for a cheap restaurant inform(price=cheap) System: How about Thai food?User: Yes please inform(price=cheap, food=Thai) System: The House serves cheap Thai food User: Where is it?inform(price=cheap, food=Thai); request(address) System: The House is at 106 Regent Street A task-based dialogue system is backed by an ontology which describes the range of user intents that the system can process and provides the building blocks of the belief state (i.e.price=cheap, address, etc.).DST systems depend on identifying mentions of ontology items in user utterances, which becomes a non-trivial task when confronted with lexical variation, noisy speech recognition and the effect of context.Some approaches assume that a separate Spoken Language Understanding (SLU) module will solve this problem for them.However, coupling SLU and DST in a single model has been shown to improve tracking performance (Henderson et al., 2014d).On the other hand, these coupled models often rely on manually constructed dictionaries to identify inexact mentions of ontology items; this is clearly not scalable to larger, more complex domains.
In this paper, we present two new models, collectively called the Neural Belief Tracker (NBT) family.The NBT models couple SLU and DST, efficiently learning to handle lexical variation and input noise without requiring any hand-crafted resources.In evaluations on two datasets, we show that NBT models match the performance of state-of-the-art models that have greater resource requirements and outperform models with comparable task-specific requirements.Consequently, we believe this work proposes a framework better-suited to scaling belief tracking models for deployment in real-world dialogue systems operating over sophisticated application domains.

Background
Models for probabilistic dialogue state tracking, or belief tracking, were introduced as components of spoken dialogue systems in order to better handle noisy speech recognition output and other sources of uncertainty in understanding a user's goals (Bohus and Rudnicky, 2006;Williams and Young, 2007;Young et al., 2010).Modern dialogue management policies can learn to use a tracker's distribution over intentions to decide whether to execute an action or request clarification from the user.As mentioned above, the DSTC shared tasks have spurred research on this problem and institutionalised a standard evaluation paradigm (Williams et al., 2013;Henderson et al., 2014b;Henderson et al., 2014a).In this setting, the task is defined by an ontology that enumerates the goals a user can inform and the attributes of entities that the user can request information about.Many different belief tracking models have been proposed in the literature, from generative (Thomson and Young, 2010) and discriminative (Henderson et al., 2014d) statistical models to rule-based systems (Wang and Lemon, 2013).In order to motivate the work presented here, we can categorise prior research according to their reliance (or otherwise) on a separate SLU module for interpreting the user's utterances:1 Separate SLU: Traditional SDS pipelines use Spoken Language Understanding (SLU) decoders to detect slot-value pairs expressed in the Automated Speech Recognition (ASR) output.The downstream DST model then combines this information with the past dialogue context to update the belief state (Thomson and Young, 2010;Wang and Lemon, 2013;Lee and Kim, 2016;Perez, 2016;Sun et al., 2016).In the DSTC challenges, some systems used the output of template-based matching systems such as Phoenix (Wang, 1994).However, more robust and accurate statistical SLU systems are available.Many discriminative approaches to spoken dialogue SLU train independent binary models that decide whether each slot-value pair was expressed in the user utterance.Given enough data, these models can learn which lexical features are good indicators for a given value and can capture elements of paraphrasing (Mairesse et al., 2009).This line of work later shifted focus to robust handling of rich ASR output (Henderson et al., 2012;Tur et al., 2013).SLU has also been treated as a sequence labelling problem, where each word in an utterance is labelled according to its role in the user's intent; standard labelling models such as CRFs or Recurrent Neural Networks can then be used (Raymond and Ricardi, 2007;Yao et al., 2014;Celikyilmaz and Hakkani-Tur, 2015;Mesnil et al., 2015).Other approaches adopt a more complex modelling structure inspired by semantic parsing (Saleh et al., 2014;Vlachos and Clark, 2014).One drawback shared by these methods is their resource requirements, either because they need to learn independent parameters for each slot and value or because they need finegrained manual annotation at the word level.This hinders scaling to larger, more realistic domains.
Joint SLU/DST: Research on belief tracking has found it advantageous to reason about SLU and DST jointly, taking ASR predictions as input and generating belief states as output (Henderson et al., 2014d;Sun et al., 2014;Zilka and Jurcicek, 2015).In DSTC2, systems with no external SLU component outperformed all systems that only used external SLU features.Joint models often rely on a strategy known as delexicalisation whereby slots and values mentioned in the text are replaced with generic labels.
Once the dataset is transformed in this manner, one can extract a collection of template-like n-gram features such as [want tagged-value food].To perform belief tracking, the shared model iterates over all slot-value pairs, extracting delexicalised feature vectors and making a separate binary decision regarding each pair.Delexicalisation introduces a hidden dependency that is rarely discussed: how do we identify slot/value mentions in text?For toy domains, one can manually construct semantic dictionaries which list the potential rephrasings for all slot values.As shown by Mrkšić et al. (2016), the use of such dictionaries is essential for the performance of current delexicalisation-based models.Again though, this will not scale to the rich variety of user language or to general domains.
The primary motivation for the work presented in this paper is to address the limitations that affect previous belief tracking models: we learn efficiently from the available data by maximising the number of parameters that are shared across the ontology, while still giving the system the flexibility to learn paraphrasings and other kinds of variation that make it infeasible to rely on exact matching and delexicalisation as a robust strategy.

Neural Belief Tracker
The Neural Belief Tracker (NBT) is a model designed to detect the slot-value pairs expressed in any given user utterance during the flow of dialogue.Its input consists of the system dialogue acts preceding the user input, the user utterance itself, and a single slotvalue pair it needs to make a decision about.To do belief tracking, we iterate over all candidate pairs (defined by the domain ontology) and use the NBT to decide which ones have been expressed by the user.
Figure 1 presents the flow of information in the model.The first layer in the NBT hierarchy performs representation learning for the three model inputs, producing vector representations for the user utterance (r), the current candidate slot-value pair (c) and the system dialogue acts (t q , t s , t v ).Subsequently, the learned vector representations interact through the context modelling and semantic decoding submodules to obtain the intermediate interaction summary vectors d r , d c and d.These are used as input to the final decision-making module which decides whether the user expressed the intent represented by the current candidate slot-value pair.

Representation Learning
For any given user utterance, system act(s) and candidate slot-value pair, the representation learning submodules produce vector representations which act as input for the downstream components of the model.All representation learning subcomponents make use of pre-trained collections of word vectors.As shown by Mrkšić et al. (2016), specialising word vectors to express semantic similarity rather than relatedness-by-association is essential for improving belief tracking performance.For this reason, we use the semantically-specialised Paragram-SL999 word vectors (Wieting et al., 2015) throughout this work.The NBT training procedure keeps these vectors fixed: that way, at test time, unseen words semantically related to familiar slot values (i.e.inexpensive to cheap) will be recognised purely by their position in the original vector space.This also means that candidate slot-value pairs implicitly parametrise the NBT model, allowing its parameters to be shared across all values of a slot, or even across all slots.
We consider a user utterance u consisting of k u words u 1 , u 2 , . . ., u ku .Each word has an associated word vector u 1 , . . ., u ku .We propose two different models for producing a representation of u: NBT-DNN and NBT-CNN.Both act over the constituent ngrams of the utterance.Let v n i be the concatenation of the n word vectors starting at index i, so that: where ⊕ denotes vector concatenation.The simpler version of the two models, which we term NBT-DNN, is shown in Figure 2.This model computes cumulative n-gram representation vectors r 1 , r 2 and r 3 , which are the n-gram 'summaries' of the unigrams, bigrams and trigrams in the user utterance: Each of these vectors is then non-linearly mapped to intermediate representations of the same size:   where the weight matrices and bias terms map the cumulative n-grams to vectors of the same size and σ denotes the sigmoid activation function.Here we maintain a separate set of parameters for each slot (indicated by superscript s); we later investigate tying parameters across all slots (Section 6.2).The three vectors are summed to obtain a single representation for the user utterance: The cumulative n-gram representations used by this model are just an unweighted sum of all word vectors in the utterance.Ideally, the model should learn to recognise which parts of the utterance are more relevant for the subsequent classification task.For instance, it could learn to ignore verbs or stop words and pay more attention to adjectives and nouns which are more likely to express slot values.
Our second model, which we term NBT-CNN, draws inspiration from successful applications of Convolutional Neural Networks (CNNs) for language understanding (Collobert et al., 2011;Kalchbrenner et al., 2014;Kim, 2014).These models typically apply a number of convolutional filters to n-grams in the input sentence, followed by non-linear activation functions and max-pooling.Following this approach, the NBT-CNN model applies L = 300 different filters for n-gram lengths of 1, 2 and 3 (Figure 3).Let F s n ∈ R L×nD denote the collection of filters for each value of n, where D = 300 is the word vector dimensionality.If v n i denotes the concatenation of n word vectors starting at index i, let m n = [v n 1 ; v n 2 ; . . .; v n ku−n+1 ] be the list of n-grams that convolutional filters of length n run over.The three intermediate representations are then given by: Each column of the intermediate matrices R n is produced by a single convolutional filter of length n.
We obtain summary n-gram representations by pushing these representations through a rectified linear unit (ReLu) activation function and max-pooling over time (i.e.columns) to get a single feature for each of the L filters applied to the utterance: where b s n is a bias term broadcast across all filters.Finally, the three summary n-gram representations are summed to obtain the final utterance representation vector r (as in Equation 4).The NBT-CNN model is (by design) better suited to longer utterances, as its convolutional filters interact with subsequences of the utterance, and not just the noisy summaries given by the NBT-DNN's cumulative n-grams.

Semantic Decoding
The NBT diagram in Figure 1 shows that the utterance representation r and the candidate slot-value pair representation c directly interact through the semantic decoding module.This component decides whether the user explicitly expressed an intent matching the current candidate pair (i.e.without taking the dialogue context into account).Examples of such matches would be 'I want Thai food' with food=Thai or more demanding ones such as 'a pricey restaurant' with price=expensive.This is where the use of high-quality pre-trained word vectors comes into play: a delexicalisation-based model could deal with the former example but would be helpless in the latter case, unless a human expert had provided a semantic dictionary listing all potential rephrasings for each value in the domain ontology.
Let the vector space representations of a candidate pair's slot name and value be given by c s and c v (with word vectors of multi-word slot names or values summed together).The NBT model learns to map this vector tuple into a single vector c of the same dimensionality as the utterance representation r.These two representations are then forced to interact in order to learn a similarity metric which discriminates between interactions of utterances with slot-value pairs that they either do or do not express: where ⊗ denotes element-wise vector multiplication.
The dot product, which may seem like the more intuitive similarity metric, would reduce the rich set of features in d to a single scalar.The element-wise multiplication allows the downstream network to make better use of its parameters by learning non-linear interactions between sets of features in r and c.This network (Binary Decision Maker in Figure 1) uses one intermediate hidden layer of size 100 to make the final decision about the current candidate pair.

Context Modelling
This 'decoder' does not yet suffice to extract intents from utterances in human-machine dialogue.To understand some queries, the belief tracker must be aware of context, i.e. the flow of dialogue leading up to the latest user utterance.While all previous system and user utterances are important, the most relevant one is the last system utterance, in which the dialogue system could have performed (among others) one of the following two system acts: 1. System Request: The system asks the user about the value of a specific slot T q .If the system utterance is: 'what price range would you like?' and the user answers with any, the model must infer the reference to price range, and not to other slots such as area or food type.

System Confirm:
The system asks the user to confirm whether a specific slot-value pair (T s , T v ) is part of their desired constraints.For example, if the user responds to 'how about Turkish food?' with 'yes', the model must be aware of the system act in order to correctly update the belief state.
If we make the Markovian decision to only consider the last set of system acts, we can incorporate context modelling into the NBT model.Let t q and (t s , t v ) be the word vector representations of the arguments for the system request and confirm acts (zero vectors if none).The model computes the following measures of similarity between the system acts, candidate pair (c s , c v ) and utterance representation r: where • denotes dot product.The computed similarity terms act as gating mechanisms which only pass the utterance representation through if the system asked about the current candidate slot or slot-value pair.This type of interaction is particularly useful for the confirm system act: if the system asks the user to confirm, the user is likely not to mention any slot values, but to just respond affirmatively or negatively.This means that the model must consider the three-way interaction between the utterance, candidate slot-value pair and the slot value pair offered by the system.If (and only if) the latter two are the same should the model consider the affirmative or negative polarity of the user utterance when making the subsequent binary decision.Finally, these two context modelling summary representations are passed to the decision making module, which combines them with the semantic decoding output to make the final decision.

Belief Tracking
In spoken dialogue systems, belief tracking models operate over the output of automated speech recognition (ASR).Despite ever-improving speech recognition, the need to make the most out of imperfect ASR will persist as dialogue systems are used in increasingly noisy environments.In this work, we use a simple rule-based belief state update mechanism which can be applied to ASR N -best lists.
Let u denote a list of N ASR hypotheses h i with posterior probabilities p i which follow system output sys.For any hypothesis h, slot s and slot value v ∈ V s , the NBT model estimates P(s, v | h, sys), i.e. the probability that (s, v) was expressed in h.The predictions for N such hypotheses are combined as: For slot s, the set of its detected values is given by: For informable slots, the value in V * s with the highest probability is chosen (if V * s = ∅).This goal value persists in subsequent turns until a new value is detected for this slot.For requests, all slots in V * req are deemed to have been requested.As requestable slots serve to model users' single-turn queries, they require no belief tracking across turns.

Experiments
Two datasets were used for training and evaluation.Both consist of user conversations with task-oriented dialogue systems designed to help users find suitable restaurants around Cambridge, UK.The two corpora share the same domain ontology, which contains three informable (i.e.goal) slots: food type, area and price.The users can specify values for these slots in order to find restaurants which best meet their criteria.Once the system suggests a restaurant, the users can ask about the values of up to eight requestable slots (phone number, address, etc.).The two datasets are: 1. DSTC2: we use the transcriptions, ASR hypotheses and turn-level semantic labels provided for the Dialogue State Tracking Challenge 2 (Henderson et al., 2014a).The official transcriptions contain various spelling errors which we corrected manually; the cleaned version of the dataset will be made available on the first author's website.The training data contains 2207 dialogues (15611 turns) and the test set consists of 1117 dialogues.We train the NBT models on transcriptions and report their belief tracking performance on test set ASR hypotheses.
2. WOZ: Wen et al. ( 2016) performed a Wizard of Oz style experiment in which Amazon Mechanical Turk users assumed the role of the system or the user of a task-oriented dialogue system based on the DSTC2 ontology.Users typed instead of using speech, which means performance in the WOZ experiments is more indicative of the model's capacity for semantic understanding than its robustness to ASR errors.The experimental design gave users more freedom to use sophisticated language, making understanding much more difficult.The WOZ dataset contains 536 training dialogues (2209 turns) and 140 test dialogues.
The two corpora are used to synthesise training data for the NBT models.We iterate over all utterances, generating one example for each of the slot-value pairs in the DSTC2 ontology.Examples consist of a transcription, its context (i.e.list of system acts) and a candidate slot-value pair.The binary label for each example indicates whether or not its utterance and context express the example's candidate pair.To train the model, we used the Adam optimizer (Kingma and Ba, 2015) with a cross-entropy loss function, backpropagating through all the NBT subcomponents (but keeping the word vectors fixed).Both models were implemented using the Tensor-Flow framework (Abadi et al., 2015).6 Results

Belief Tracking Performance
Table 1 shows the belief tracking performance of models trained and evaluated on DSTC2 data.We compare to a delexicalisation-based RNN model introduced by Henderson et al. (2014d;2014c).The baseline model uses no semantic dictionary, while the improved baseline uses a hand-crafted semantic dictionary designed for the DSTC2 ontology by Henderson et al.The NBT models clearly outperform the first baseline in terms of average goal accuracy but fare marginally worse on requests.The NBT models' goal tracking performance is comparable to the improved baseline.This shows that these models can handle semantic relations which otherwise had to be explicitly encoded in semantic dictionaries.As NBT's rule-based belief state update is less sophisticated than Henderson et al.'s RNN, we believe that further gains are possible.
The WOZ dataset provides a more focused benchmark for assessing the semantic understanding capacity of belief tracking models.The baseline performance is a delexicalisation-based model incorpo-  Again, the improved baseline uses a hand-crafted semantic dictionary.Table 2 shows the performance of NBT models trained on either DSTC2 or WOZ transcriptions.The NBT-CNN model outperforms the baselines across the board, even when it is trained using the (simpler) DSTC2 transcriptions.This shows that the CNN-based model is capable of handling long utterances and previously unseen vocabulary.
Conversely, the NBT-DNN model falls short of both goal tracking baselines, as its cumulative n-gram representations cannot cope with some of the very long and complex WOZ utterances.Contrary to the results in Table 1, both NBT models trained on in-domain WOZ data match or exceed the baselines' performance over requestable slots.

General Models: Slot-Parameter Tying
To train a single model which can be applied to all slots in the ontology (as opposed to using a separate model for each slot), we coalesce all training data and use it to train an NBT model.As shown by Mrkšić et al. (2015), such parameter tying can facilitate transfer learning and lead to improved belief tracking performance.Table 3 shows the impact of tying the parameters across all slots for NBT models trained on DSTC2 and WOZ together: it boosts NBT-DNN performance and leads to an improved trade-off between NBT-CNN's performance on informable and requestable slots.Figure 4 shows the t-SNE visualisation (van der Maaten and Hinton, 2008) of the user utterance representations produced by the NBT-DNN model with tied parameters.The figure shows that the model learns meaningful representations which cluster utterances with related intents together.Moreover, it reveals that utterances such as 'yes' and 'right' or 'I don't care' and 'It doesn't matter' are mapped close together, proving that the NBT utterance representations capture a lot of semantic information from the underlying word vectors and the (relatively few) annotated transcriptions that they are exposed to.

The Importance of Word Vector Spaces
The NBT models use the semantic relations embedded in the pre-trained word vectors to handle semantic variation and produce high-quality intermediate representations.Table 4 shows the effect of using different word vectors to train the two NBT models.Using GloVe vectors (Pennington et al., 2014) in place of Paragram-SL999 (Wieting et al., 2015) drastically reduced the models' goal tracking capabilities.Interestingly, it had much less of an effect on requests, suggesting that the training set is large enough for the NBT models to learn specific phrases which trigger these ('How much?', 'Where is it?'),but not to ensure coverage for the numerous rephrasings of categorical slot values.

Conclusion
In this paper, we have proposed a novel belief tracking framework designed to overcome current obstacles to deploying dialogue systems in larger dialogue domains.The Neural Belief Tracker can use a single set of parameters to decide whether the given user utterance (and its context) express each of the intents defined by the domain ontology.This is achieved by learning similarity metrics which operate over intermediate representations composed from semantically specialised word vector representations.The NBT framework offers the known advantages of coupling Spoken Language Understanding and Dialogue State Tracking, without requiring the use of (hand-crafted) semantic lexicons to boost SLU recall.Our evaluation demonstrates these benefits, with NBT matching the performance of models that rely on such lexicons.Finally, we showed that sharing NBT models' parameters across slots can boost overall belief tracking performance, supporting our argument that these models promise to scale to complex domains with diverse and unequally distributed slots.

Figure 1 :
Figure 1: Architecture of the NBT Model.The implementation of the three representation learning subcomponents can be modified, as long as these produce adequate vector representations which the downstream model components can use to decide whether the current candidate slot-value pair was expressed in the user utterance (taking into account the preceding system acts).

Figure 2 :
Figure 2: NBT-DNN Model.Word vectors of all unigrams, bigrams and trigrams are summed to obtain cumulative n-gram representations.These are passed through another hidden layer and then summed to obtain the final utterance representation r.

Figure 3 :
Figure 3: NBT-CNN Model.L convolutional filters of window sizes 1, 2, 3 are applied to word vectors of the given utterance (L = 3 in the diagram, but L = 300 in the system).The convolution is followed by an application of the ReLu activation function and max-pooling to produce summary n-gram representations.These are summed to obtain the final utterance representation r.

Figure 4 :
Figure 4: two-dimensional t-SNE embeddings of intermediate utterance representations (r) learned by the NBT-DNN model.The model was trained on combined DSTC2 and WOZ datasets, and its parameters were tied across all slots.The sentences shown in the figure were not cherry-picked: we chose which ones to show randomly, limiting the density of annotations to maximise visibility.

Table 1 :
Belief tracking performance (goal accuracy) over the informable and requestable slots on the DSTC2 test set.

Table 2 :
Performance (precision, recall, f-score)over joint goal and requestable slots for the WOZ test set.Datasets used to train each of the models are shown above the respective subtables.

Table 3 :
Performance comparison (WOZ test set) between a) slot-specific models; and b) models which share parameters across all slots.Trained on combined DSTC2 and WOZ data.

Table 4 :
The WOZ test set performance (f-scores) of NBT models depending on the word vector collection used.Models trained on combined DSTC2 and WOZ training transcriptions.