Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning

This paper presents an end-to-end framework for task-oriented dialog systems using a variant of Deep Recurrent Q-Networks (DRQN). The model is able to interface with a relational database and jointly learn policies for both language understanding and dialog strategy. Moreover, we propose a hybrid algorithm that combines the strength of reinforcement learning and supervised learning to achieve faster learning speed. We evaluated the proposed model on a 20 Question Game conversational game simulator. Results show that the proposed method outperforms the modular-based baseline and learns a distributed representation of the latent dialog state.


Introduction
Task-oriented dialog systems have been an important branch of spoken dialog system (SDS) research (Raux et al., 2005;Young, 2006;Bohus and Rudnicky, 2003).The SDS agent has to achieve some predefined targets (e.g.booking a flight) through natural language interaction with the users.The typical structure of a task-oriented dialog system is outlined in Figure 1 (Young, 2006).This pipeline consists of several independently-developed modules: natural language understanding (the NLU) maps the user utterances to some semantic representation.This information is further processed by the dialog state tracker (DST), which accumulates the input of the turn along with the dialog history.The DST outputs the current dialog state and the dialog policy selects the next system action based on the dialog state.Then natural language generation (NLG) maps the selected action to its surface form which is sent to the TTS (Text-to-Speech).This process repeats until the agent's goal is satisfied.The conventional SDS pipeline has limitations.The first issue is the credit assignment problem.Developers usually only get feedback from the end users, who inform them about system performance quality.Determining the source of the error requires tedious error analysis in each module because errors from upstream modules can propagate to the rest of the pipeline.The second limitation is process interdependence, which makes online adaptation challenging.For example, when one module (e.g.NLU) is retrained with new data, all the others (e.g DM) that depend on it become sub-optimal due to the fact that they were trained on the output distributions of the older version of the module.Although the ideal solution is to retrain the entire pipeline to ensure global optimality, this requires significant human effort.
Due to these limitations, the goal of this study is to develop a neural network-based end-to-end framework for task-oriented SDS that replaces 3 important modules: the NLU, the DST and the dialog policy with a single module that can be jointly optimized.Developing such a model for task-arXiv:1606.02560v1[cs.AI] 8 Jun 2016 oriented dialog systems faces several challenges.The foremost challenge is that a task-oriented system must learn a strategic dialog policy that can achieve the goal of a given task which is beyond the ability of standard supervised learning (Li et al., 2014).The second challenge is that often a task-oriented agent needs to interface with structured external databases, which have symbolic query formats (e.g.SQL query).In order to find answers to the users' requests from the databases, the agent must formulate a valid database query.This is difficult for conventional neural network models which do not provide intermediate symbolic representations.
This paper describes a deep reinforcement learning based end-to-end framework for both dialog state tracking and dialog policy that addresses the above-mentioned issues.We evaluated the proposed approach on a conversational game simulator that requires both language understanding and strategic planning.Our studies yield promising results 1) in jointly learning policies for state tracking and dialog strategies that are superior to a modular-based baseline, 2) in efficiently incorporating various types of labelled data and 3) in learning dialog state representations.
Section 2 of the paper discusses related work; Section 3 reviews the basics of deep reinforcement learning; Section 4 describes the proposed framework; Section 5 gives experimental results and model analysis; and Section 6 concludes.

Related Work
Dialog State Tracking: The process of constantly representing the state of the dialog is called dialog state tracking (DST).Most industrial systems use rule-based heuristics to update the dialog state by selecting a high-confidence output from the NLU (Williams et al., 2013).Numerous advanced statistical methods have been proposed to exploit the correlation between turns to make the system more robust given the uncertainty of the automatic speech recognition (ASR) and the NLU (Bohus and Rudnicky, 2006;Thomson and Young, 2010).The Dialog State Tracking Challenge (DSTC) (Williams et al., 2013) formalizes the problem as a supervised sequential labelling task where the state tracker estimates the true slot values based on a sequence of NLU outputs.In practice the output of the state tracker is used by a different dialog policy, so that the distribution in the training data and in the live data are mismatched (Williams et al., 2013).Therefore one of the basic assumptions of DSTC is that the state tracker's performance will translate to better dialog policy performance.Lee (2014) showed positive results following this assumption by showing a positive correlation between end-to-end dialog performance and state tracking performance.
Reinforcement Learning (RL): RL has been a popular approach for learning the optimal dialog policy of a task-oriented dialog system (Singh et al., 2002;Williams and Young, 2007;Georgila and Traum, 2011;Lee and Eskenazi, 2012).A dialog policy is formulated as a Partially Observable Markov Decision Process (POMDP) which models the uncertainty existing in both the users' goals and the outputs of the ASR and the NLU.Williams (2007) showed that POMDP-based systems perform significantly better than rule-based systems especially when the ASR word error rate (WER) is high.Other work has explored methods that improve the amount of training data needed for a POMDP-based dialog manager.Gašić (2010) utilized Gaussian Process RL algorithms and greatly reduced the data needed for training.Existing applications of RL to dialog management assume a given dialog state representation.Instead, our approach learns its own dialog state representation from the raw dialogs along with a dialog policy in an end-to-end fashion.
End-to-End SDSs: There have been many attempts to develop end-to-end chat-oriented dialog systems that can directly map from the history of a conversation to the next system response (Vinyals and Le, 2015;Serban et al., 2015).These methods train sequence-to-sequence models (Sutskever et al., 2014) on large human-human conversation corpora.The resulting models are able to do basic chatting with users.The work in this paper differs from them by focusing on building a task-oriented system that can interface with structured databases and provide real information to users.
Recently, Wen el al. (2016) introduced a network-based end-to-end trainable taskedoriented dialog system.Their approach treated a dialog system as a mapping problem between the dialog history and the system response.They learned this mapping via a novel variant of the encoder-decoder model.The main differences between our models and theirs are that ours has the advantage of learning a strategic plan using RL and jointly optimizing state tracking beyond standard supervised learning.

Deep Reinforcement Learning
Before describing the proposed algorithms, we briefly review deep reinforcement learning (RL).RL models are based on the Markov Decision Process (MDP).An MDP is a tuple (S, A, P, γ, R), where S is a set of states; A is a set of actions; P defines the transition probability P (s |s, a); R defines the expected immediate reward R(s, a); and γ ∈ [0, 1) is the discounting factor.The goal of reinforcement learning is to find the optimal policy π * , such that the expected cumulative return is maximized (Sutton and Barto, 1998).MDPs assume full observability of the internal states of the world, which is rarely true for real-world applications.The Partially Observable Markov Decision Process (POMDP) takes the uncertainty in the state variable into account.A POMDP is defined by a tuple (S, A, P, γ, R, O, Z).O is a set of observations and Z defines an observation probability P (o|s, a).The other variables are the same as the ones in MDPs.Solving a POMDP usually requires computing the belief state b(s), which is the probability distribution of all possible states, such that s b(s) = 1.It has been shown that the belief state is sufficient for optimal control (Monahan, 1982), so that the objective is to find π * : b → a that maximizes the expected future return.

Deep Q-Network
The deep Q-Network (DQN) introduced by Mnih (2015) uses a deep neural network (DNN) to parametrize the Q-value function Q(s, a; θ) and achieves human-level performance in playing many Atari games.DQN keeps two separate models: a target network θ − i and a behavior network θ i .For every K new samples, DQN uses θ − i to compute the target values y DQN and updates the parameters in θ i .Only after every C updates, the new weights of θ i are copied over to θ − i .Furthermore, DQN utilizes experience replay to store all previous experience tuples (s, a, r, s ).Before a new model update, the algorithm samples a minibatch of experiences of size M from the memory and computes the gradient of the following loss function: Recently, Hasselt et al. (2015) leveraged the overestimation problem of standard Q-Learning by introducing double DQN and Schaul et al. (2015) improves the convergence speed of DQN via prioritized experience replay.We found both modifications useful and included them in our studies.

Deep Recurrent Q-Network
An extension to DQN is a Deep Recurrent Q-Network (DRQN) which introduces a Long Short-Term Memory (LSTM) layer (Hochreiter and Schmidhuber, 1997) on top of the convolutional layer of the original DQN model (Hausknecht and Stone, 2015) which allows DRQN to solve POMDPs.The recurrent neural network can thus be viewed as an approximation of the belief state that can aggregate information from a sequence of observations.Hausknecht (2015) shows that DRQN performs significantly better than DQN when an agent only observes partial states.A similar model was proposed by Narasimhan and Kulkarni (2015) and learns to play Multi-User Dungeon (MUD) games (Curtis, 1992) with game states hidden in natural language paragraphs.
4 Proposed Model

Overview
End-to-end learning refers to models that can back-propagate error signals from the end output to the raw inputs.Prior work in end-to-end state tracking (Henderson et al., 2014) learns a sequential classifier that estimates the dialog state based on ASR output without the need of an NLU.Instead of treating state tracking as a standard supervised learning task, we propose to unify dialog state tracking with the dialog policy so that both are treated as actions available to a reinforcement learning agent.Specifically, we learn an optimal policy that either generates a verbal response or modifies the current estimated dialog state based on the new observations.This formulation makes it possible to obtain a state tracker even without the labelled data required for DSTC, as long as the rewards from the users and the databases are available.Furthermore, in cases where dialog state tracking labels are available, the proposed model can incorporate them with minimum modification and greatly accelerate its learning speed.Thus, the following sections describe two models: RL and Hybrid-RL, corresponding to two labelling scenarios: 1) only dialog success labels and 2) dialog success labels with state tracking labels.

Learning from the Users and Databases
Figure 2 shows an overview of the framework.We consider a task-oriented dialog task, in which there are S slots, each with cardinality C i , i ∈ [0, S).
The environment consists of a user, E u and a database E db .The agent can send verbal actions, a v ∈ A v to the user and the user will reply with natural language responses o u and rewards r u .In order to interface with the database environment E db , the agent can apply special actions a h ∈ A h that can modify a query hypothesis h.The hypothesis is a slot-filling form that represents the most likely slot values given the observed evidence.Given this hypothesis, h, the database can perform a normal query and give the results as observations, o db and rewards r db .At each turn t, the agent applies its selected action a t ∈ {A v , A h } and receives the observations from either the user or the database.We can then define the observation o t of turn t as, We then use the LSTM network as the dialog state tracker that is capable of aggregating information over turns and generating a dialog state representation, b t = LST M (o t , b t−1 ), where b t is an approximation of the belief state at turn t.Finally, the dialog state representation from the LSTM network is the input to S + 1 policy networks implemented as Multilayer Perceptrons (MLP).The first policy network approximates the Q-value function for all verbal actions Q(b t , a v ) while the rest estimate the Q-value function for each slot, Q(b t , a v ), as shown in Figure 3.

Incorporating State Tracking Labels
The pure RL approach described in the previous section could suffer from slow convergence when the cardinality of slots is large.This is due to the nature of reinforcement learning: that it has to try different actions (possible values of a slot) in order to estimate the expected long-term payoff.On the other hand, a supervised classifier can learn much more efficiently.A typical multi-class classification loss function (e.g.categorical cross entropy) assumes that there is a single correct label such that it encourages the probability of the correct label and suppresses the probabilities of the all the wrong ones.Modeling dialog state tracking as a Q-value function has advantages over a local classifier.For instance, take the situation where a user wants to send an email and the state tracker needs to estimate the user's goal from among three possible values: send, edit and delete.In a classification task, all the incorrect labels (edit, delete) are treated as equally undesirable.However, the cost of mistakenly recognizing the user goal as delete is much larger than edit, which can only be learned from the future rewards.In order to train the slot-filling policy with both short-term and long-term supervision signals, we decompose the reward function for A h into two parts: where P (a h |b) is the conditional probability that the correct label of the slots is a h given the cur-rent belief state.In practice, instead of training a separate model estimating P (a h |b), we can replace P (a h |b) by 1(y = a h ) as the sample reward r, where y is the label.Furthermore, a key observation is that although it is expensive to collect data from the user E u , one can easily sample trajectories of interaction with the database since P (b |b, a h ) is known.Therefore, we can accelerate learning by generating synthetic experiences, i.e. tuple (b, a h , r, b )∀a h ∈ A h and add them to the experience replay buffer.This approach is closely related to the Dyna Q-Learning proposed in (Sutton, 1990).The difference is that Dyna Qlearning uses the estimated environment dynamics to generating experiences, while our method only uses the known transition function (i.e. the dynamics of the database) to generate synthetic samples.

Implementation Details
We can optimize the network architecture in several ways to improve its efficiency: Shared State Tracking Policies: it is more efficient to tie the weights of the policy networks for similar slots and use the index of slot as an input.This can reduce the number of parameters that needs to be learned and encourage shared structures.The studies in Section 5 illustrates one example.
Constrained Action Mask: We can constrain the available actions at each turn to force the agent to alternate between verbal response and slot-filling.We define A mask as a function that takes state s and outputs a set of available actions for: Reward Shaping based on the Database: the reward signals from the users are usually sparse (at the end of a dialog), the database, however, can provide frequent rewards to the agent.Reward shaping is a technique used to speed up learning.Ng et al. (1999) showed that potential-based reward shaping does not alter the optimal solution; it only impacts the learning speed.The pseudo reward function F (s, a, s ) is defined as: R(s, a, s ) = R(s, a, s ) + F (s, a, s ) (8) Let the total number of entities in the database be D and P max be the max potential, the potential φ(s) is: The intuition of this potential function is to encourage the agent to narrow down the possible range of valid entities as quickly as possible.Meanwhile, if no entities are consistent with the current hypothesis, this implies that there are mistakes in previous slot filling, which gives a potential of 0.

20Q Game as Task-oriented Dialog
In order to test the proposed framework, we chose the 20 Question Game (20Q).The game rules are as follows: at the beginning of each game, the user thinks of a famous person.Then the agent asks the user a series of Yes/No questions.The user honestly answers, using one of three answers: yes, no or I don't know.In order to have this resemble a dialog, our user can answer with any natural utterance representing one of the three intents.The agent can make guesses at any turn, but a wrong guess results in a negative reward.The goal is to guess the correct person within a maximum number of turns with the least number of wrong guesses.An example game conversation is as follows: Sys: Is this person male?User: Yes I think so.Sys: Is this person an artist?User: He is not an artist.... Sys: I guess this person is Bill Gates.User: Correct.
We can formulate the game as a slot-filling dialog.Assume the system has |Q| available questions to select from at each turn.The answer to each question becomes a slot and each slot has three possible values: yes/no/unknown.Due to the length limit and wrong guess penalty, the optimal policy does not allow the agent to ask all of the questions regardless of the context or guess every person in the database one by one.

Simulator Construction
We constructed a simulator for 20Q.The simulator has two parts: a database of 100 famous people and a user simulator.
We selected 100 people from Freebase (Bollacker et al., 2008), each of them has 6 attributes: birthplace, degree, gender, profession and birthday.We manually designed several Yes/No questions for each attribute that is available to the agent.Each question covers a different set of possible values for a given attribute and thus carries a different discriminative power to pinpoint the person that the user is thinking of.As a result, the agent needs to judiciously select the question, given the context of the game, in order to narrow down the range of valid people.There are 31 questions.Table 1 shows a summary.

Attribute
Q a Example Question Birthday 3 Was he/she born before 1950?Birthplace 9 Was he/she born in USA?Degree 4 Does he/she have a PhD? Gender 2 Is this person male?Profession 8 Is he/she an artist?Nationality 5 Is he/she a citizen of an Asian country?
Table 1: Summary of the available questions.Q a is the number of questions for attribute a.
At the beginning of each game, the simulator will first uniformly sample a person from the database as the person it is thinking of.Also there is a 5% chance that the simulator will consider unknown as an attribute and thus it will answer with unknown intent for any question related to it.After the game begins, when the agent asks a question, the simulator first determines the answer (yes, no or unknown) and replies using natural language.In order to generate realistic natural language with the yes/no/unknown intent, we collected utterances from the Switchboard Dialog Act (SWDA) Corpus (Jurafsky et al., 1997).Table 2 presents the mapping from the SWDA dialog acts to yes/no/unknown.We further post-processed results and removed irrelevant utterances, which led to 508, 445 and 251 unique utterances with intent respectively yes/no/unknown.We keep the frequency counts for each unique expression.Thus at run time, the simulator can sample a response according to the original distribution in the SWDA Corpus.

SWDA tags Yes
Agree, Yes answers, Affirmative non-yes answers No No answers, Reject, Negative non-no answers Unknown Maybe, Other Answer Table 2: Dialog act mapping from SWDA to yes/no/unknown A game is terminated when one of the four conditions is fulfilled: 1) the agent guesses the correct answer, 2) there are no people in the database consistent with the current hypothesis, 3) the max game length (100 steps) is reached and 4) the max number of guesses is reached (10 guesses).Only if the agent guesses the correct answer (condition 1) treated as a game victory.The win and loss rewards are 30 and −30 and a wrong guess leads to a −5 penalty.

Training Details
The user environment E u is the simulator that only accepts verbal actions, either a Yes/No question or a guess, and replies with a natural language utterance.Therefore A v contains |Q| + 1 actions, in which the first |Q| actions are questions and the last action makes a guess, given the results from the database.
The database environment reads in a query hypothesis h and returns a list of people that satisfy the constraints in the query.h has a size of |Q| and each dimension can be one of the three values: yes/no/unknown.Since the cardinality for all slots is the same, we only need 1 slot-filling policy network with 3 Q-value outputs for yes/no/unknown, to modify the value of the latest asked question, which is the shared policy approach mentioned in Section 4. Thus A h = {yes, no, unknown}.For example, considering Q = 3 and the hypothesis h is: [unknown, unknown, unknown].If the latest asked question is Q 1 (1-based), then applying action a h = yes will result in the new hypothesis: [yes, unknown, unknown].
To represent the observation o t in vectorial form, we use a bag-of-bigrams feature vector to represent a user utterance; a one-hot vector to represent a system action and a single discrete number to represent the number of people satisfying the current hypothesis.
The hyper-parameters of the neural network model are as follows: the size of turn embedding is 30; the size of LSTMs is 256; each policy network has a hidden layer of 128 with tanh nonlinear activation.We also add a dropout rate of 0.3 for both LSTMs and tanh layer outputs.The network has a total of 470,005 parameters.The network was trained through RM SP rop (Tieleman and Hinton, 2012).For hyper-parameters of DRQN, the behavior network was updated every 4 steps and the interval between each target network update C is 1000.The reward shaping constant P max is 2 and the discounting factor γ is 0.99.The resulting network was evaluated every 5000 steps and the model was trained up to 120,000 steps.Each evaluation records the agent's performance with a greedy policy for 200 independent episodes.

Dialog Policy Analysis
We compare the performance of three models: a strong modular baseline, RL and Hybrid-RL.The baseline has an independently trained state tracker and dialog policy.The state tracker is also an LSTM-based classifier that inputs a dialog history and predicts the slot-value of the latest question.The dialog policy is a DRQN that assumes perfect slot-filling during training and simply controls the next verbal action.Thus the essential difference between the baseline and the proposed models is that the state tracker and dialog policy are not trained jointly.Also, since hybrid-RL effectively changes the reward function, the typical average cumulative reward metric is not applicable for performance comparison.Therefore, we directly compare the win rate and average game length in later discussions.Table 3 shows that both proposed models achieve significantly higher win rate than the baseline by asking more questions before making guesses.Figure 4 illustrates the learning process of the three models.The baseline model has the fastest learning speed but its performance saturated quickly because the dialog policy was not trained together with the state tracker.So the dia-log policy is not aware of the uncertainty in slotfilling and the slot-filler does not distinguish between the consequences of different wrong labels (e.g classify yes to no versus to unknown).On the other hand, although RL reaches high performance at the end of the training, it struggles in the early stages and suffers from slow convergence.This is due to that fact that correct slot-filling is a prerequisite for winning 20Q, while the reward signal has a long delayed horizon in the RL approach.Finally, the hybrid-RL approach is able to converge to the optimal solution much faster than RL due to the fact that it efficiently exploits the information in the state tracking label.

State Tracking Analysis
One of the hypotheses is that the RL approach can learn a good state tracker using only dialog success reward signals.We ran the best trained models using a greedy policy and collected 10,000 samples.cate that the RL model learns a completely different strategy compared to the baseline.The RL model aims for high precision so that it predicts unknown when the input is ambiguous, which is a safer option than predicting yes/no, because confusing between yes and no may potentially lead to a contradiction and a game failure.This is very different from the baseline which does not distinguish between incorrect labels.Therefore, although the baseline achieves better classification metrics, it does not take into account the longterm payoff and performs sub-optimally in terms of overall performance.

Dialog State Representation Analysis
Tracking the state over multiple turns is crucial because the agent's optimal action depends on the history, e.g. the question it has already asked, the number of guesses it has spent.Furthermore, one of the assumptions is that the output of the LSTM network is an approximation of the belief state in the POMDP.We conducted two studies to test these hypotheses.For both studies, we ran the Hybrid-RL models saved at 20K, 50K and 100K steps against the simulator with a greedy policy and recorded 10,000 samples for each model.
The first study checks whether we can reconstruct an important state feature: the number of guesses the agent has made from the dialog state embedding.We divide the collected 10,000 samples into 80% for training and 20% for testing.We used the LSTM output as input features to a linear regression model with l2 regularization.Table 5 shows the correlation of determination r 2 increases for the model that was trained with more data.

Model
20K 50K 100K r 2 0.05 0.51 0.77 Table 5: r 2 of the linear regression for predicting the number of guesses in the test dataset.
The second study is a retrieval task.The latent state of the 20Q game is the true intent of the users' answers to all the questions that have been asked so far.Therefore, the true state vector, s has a size of 31 and each slot, s[k], k ∈ [0, 31) is one of the four values: not yet asked, yes, no, unknown.Therefore, if the LSTM output b is in fact implicitly learning the distribution over this latent state s, they must be highly correlated for a well-trained model.Therefore, for each sample, b i , i ∈ [0, 10, 000), we find its nearest 5 neighbors N (b i ) : B → [S] based on cosine distance measuring, and compute the probability that s i 's true states differ from the ones of the retrieved neighbors:  Figure 5 shows that the retrieval error continues to decrease for the model that was trained better, which confirms our assumption that the LSTM output is an approximation of the belief state.

Conclusion
This paper identifies the limitations of the conventional SDS pipeline and describes a novel end-toend framework for a task-oriented dialog system using deep reinforcement learning.We have assessed the model on the 20Q game.The proposed models show superior performance for both natural language understanding and dialog strategy.Furthermore, our analysis confirms our hypotheses that the proposed models implicitly capture essential information in the latent dialog states.
Future studies will include developing fullfledged task-orientated dialog systems using the proposed approach and exploring methods that allow easy integration of domain knowledge so that the system can be more easily debugged and corrected.

Figure 1 :
Figure 1: Conventional pipeline of an SDS.The proposed model replaces the modules in the dotted-line box with one end-to-end model.

Figure 2 :
Figure 2: An overview of the proposed end-to-end task-oriented dialog management framework.

Figure 3 :
Figure 3: The network takes the observation o t at turn t.The recurrent unit updates its hidden state based on both the history and the current turn embedding.Then the model outputs the Q-values for all actions.The policy network in grey is masked by the action mask

Figure 4 :
Figure 4: Graphs showing the evolution of the win rate during training.

4
n=0 1(N (b i )[n][k] = s i [k]) 5 (12)where 1 is the indicator function, k is the slot index and n is the neighbor index.

Figure 5 :
Figure 5: Performance of retrieving similar true dialog states using learned dialog state embedding.

Table 3 :
Performance of the three systems

Table 4 :
Table 4 reports the precision and recall of slot filling in these trajectories.The results indi-State tracking performance of the three systems.The results are in the format of precision/recall