Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning

Training a task-completion dialogue agent via reinforcement learning (RL) is costly because it requires many interactions with real users. One common alternative is to use a user simulator. However, a user simulator usually lacks the language complexity of human interlocutors and the biases in its design may tend to degrade the agent. To address these issues, we present Deep Dyna-Q, which to our knowledge is the first deep RL framework that integrates planning for task-completion dialogue policy learning. We incorporate into the dialogue agent a model of the environment, referred to as the world model, to mimic real user response and generate simulated experience. During dialogue policy learning, the world model is constantly updated with real user experience to approach real user behavior, and in turn, the dialogue agent is optimized using both real experience and simulated experience. The effectiveness of our approach is demonstrated on a movie-ticket booking task in both simulated and human-in-the-loop settings.


Introduction
Learning policies for task-completion dialogue is often formulated as a reinforcement learning (RL) problem (Young et al., 2013;Levin et al., 1997). However, applying RL to real-world dialogue systems can be challenging, due to the constraint that an RL learner needs an environment to operate in. In the dialogue setting, this requires a dialogue agent to interact with real users and adjust its policy in an online fashion, as illustrated in Figure 1(a). Unlike simulation-based games such as Atari games (Mnih et al., 2015) and AlphaGo (Silver et al., 2016a(Silver et al., , 2017 where RL has made its greatest strides, task-completion dialogue systems may incur significant real-world cost in case of failure. Thus, except for very simple tasks (Singh et al., 2002;Gašić et al., 2010Gašić et al., , 2011Pietquin et al., 2011;Li et al., 2016a;Su et al., 2016b), RL is too expensive to be applied to real users to train dialogue agents from scratch.
One strategy is to convert human-interacting dialogue to a simulation problem (similar to Atari games), by building a user simulator using human conversational data (Schatzmann et al., 2007;Li et al., 2016b). In this way, the dialogue agent can learn its policy by interacting with the simulator instead of real users (Figure 1(b)). The simulator, in theory, does not incur any real-world cost and can provide unlimited simulated experience for reinforcement learning. The dialogue agent trained with such a user simulator can then be deployed to real users and further enhanced by only a small number of human interactions. Most of recent studies in this area have adopted this strategy (Su et al., 2016a;Lipton et al., 2016;Zhao and Eskenazi, 2016;Williams et al., 2017;Dhingra et al., 2017;Liu and Lane, 2017;Peng et al., 2017b;Budzianowski et al., 2017;Peng et al., 2017a).
However, user simulators usually lack the conversational complexity of human interlocutors, and the trained agent is inevitably affected by biases in the design of the simulator. Dhingra et al. (2017) demonstrated a significant discrepancy in a simulator-trained dialogue agent when evaluated with simulators and with real users. Even more challenging is the fact that there is no universally accepted metric to evaluate a user simulator (Pietquin and Hastie, 2013). Thus, it remains controversial whether training task-completion dialogue agent via simulated users is a valid approach.
We propose a new strategy of learning dialogue policy by interacting with real users. Compared to previous works (Singh et al., 2002;Li et al., 2016a;Su et al., 2016b;Papangelis, 2012), our dialogue agent learns in a much more efficient way, using only a small number of real user interactions, which amounts to an affordable cost in many nontrivial dialogue tasks.
Our approach is based on the Dyna-Q framework (Sutton, 1990) where planning is integrated into policy learning for task-completion dialogue. Specifically, we incorporate a model of the environment, referred to as the world model, into the dialogue agent, which simulates the environment and generates simulated user experience. During the dialogue policy learning, real user experience plays two pivotal roles: first, it can be used to improve the world model and make it behave more like real users, via supervised learning; second, it can also be used to directly improve the dialogue policy via RL. The former is referred to as world model learning, and the latter direct reinforcement learning. Dialogue policy can be improved either using real experience directly (i.e., direct reinforcement learning) or via the world model indirectly (referred to as planning or indirect reinforcement learning). The interaction between world model learning, direct reinforcement learning and planning is illustrated in Figure 1(c), following the Dyna-Q framework (Sutton, 1990).
The original papers on Dyna-Q and most its early extensions used tabular methods for both planning and learning (Singh, 1992;Peng and Williams, 1993;Moore and Atkeson, 1993;Kuvayev and Sutton, 1996). This table-lookup representation limits its application to small problems only. Sutton et al. (2012) extends the Dyna architecture to linear function approximation, making it applicable to larger problems. In the dialogue setting, we are dealing with a much larger action-state space. Inspired by Mnih et al. (2015), we propose Deep Dyna-Q (DDQ) by combining Dyna-Q with deep learning approaches to representing the state-action space by neural networks (NN).
By employing the world model for planning, the DDQ method can be viewed as a model-based RL approach, which has drawn growing interest in the research community. However, most model-based RL methods (Tamar et al., 2016;Silver et al., 2016b;Gu et al., 2016;Racanière et al., 2017) are developed for simulation-based, synthetic problems (e.g., games), but not for human-in-the-loop, real-world problems. To these ends, our main contributions in this work are two-fold: • We present Deep Dyna-Q, which to the best of our knowledge is the first deep RL framework that incorporates planning for taskcompletion dialogue policy learning. • We demonstrate that a task-completion dialogue agent can efficiently adapt its policy on the fly, by interacting with real users via RL. This results in a significant improvement in success rate on a nontrivial task.  As illustrated in Figure 1(c), starting with an initial dialogue policy and an initial world model (both trained with pre-collected human conversational data), the training of the DDQ agent consists of three processes: (1) direct reinforcement learning, where the agent interacts with a real user, collects real experience and improves the dialogue policy; (2) world model learning, where the world model is learned and refined using real experience; and (3) planning, where the agent improves the dialogue policy using simulated experience.
Although these three processes conceptually can occur simultaneously in the DDQ agent, we implement an iterative training procedure, as shown in Algorithm 1, where we specify the order in which they occur within each iteration. In what follows, we will describe these processes in details.

Direct Reinforcement Learning
In this process (lines 5-18 in Algorithm 1) we use the DQN method (Mnih et al., 2015) to improve the dialogue policy based on real experience. We consider task-completion dialogue as a Markov Decision Process (MDP), where the agent inter-acts with a user in a sequence of actions to accomplish a user goal. In each step, the agent observes the dialogue state s, and chooses the action a to execute, using an -greedy policy that selects a random action with probability or otherwise follows the greedy policy a = argmax a Q(s, a ; θ Q ). Q(s, a; θ Q ) which is the approximated value function, implemented as a Multi-Layer Perceptron (MLP) parameterized by θ Q . The agent then receives reward 3 r, observes next user response a u , and updates the state to s . Finally, we store the experience (s, a, r, a u , s ) in the replay buffer D u . The cycle continues until the dialogue terminates.
We improve the value function Q(s, a; θ Q ) by adjusting θ Q to minimize the mean-squared loss function, defined as follows: where γ ∈ [0, 1] is a discount factor, and Q (.) is the target value function that is only periodically updated (line 42 in Algorithm 1). By differentiating the loss function with respect to θ Q , we arrive at the following gradient: As shown in lines 16-17 in Algorithm 1, in each iteration, we improve Q(.) using minibatch Deep Q-learning.

Planning
In the planning process (lines 23-41 in Algorithm 1), the world model is employed to generate simulated experience that can be used to improve dialogue policy. K in line 24 is the number of planning steps that the agent performs per step of direct reinforcement learning. If the world model is able to accurately simulate the environment, a big K can be used to speed up the policy learning. In DDQ, we use two replay buffers, D u for storing real experience and D s for simulated experience. Learning and planning are accomplished

Algorithm 1 Deep Dyna-Q for Dialogue Policy Learning
Require: N , , K, L, C, Z Ensure: Q(s, a; θQ), M (s, a; θM ) 1: initialize Q(s, a; θQ) and M (s, a; θM ) via pre-training on human conversational data 2: initialize Q (s, a; θ Q ) with θ Q = θQ 3: initialize real experience replay buffer D u using Reply Buffer Spiking (RBS), and simulated experience replay buffer D s as empty 4: for n=1:N do 5: # Direct Reinforcement Learning starts 6: user starts a dialogue with user action a u 7: generate an initial dialogue state s 8: while s is not a terminal state do 9: with probability select a random action a 10: otherwise select a = argmax a Q(s, a ; θQ) 11: execute a, and observe user response a u and reward r 12: update dialogue state to s 13: store (s, a, r, a u , s ) to D u 14: s = s 15: end while 16: sample random minibatches of (s, a, r, s ) from D u 17: update θQ via Z-step minibatch Q-learning according to Equation (2)  18: # Direct Reinforcement Learning ends 19: # World Model Learning starts 20: sample random minibatches of training samples (s, a, r, a u , s ) from D u 21: update θM via Z-step minibatch SGD of multi-task learning 22: # World Model Learning ends 23: # Planning starts 24: for k=1:K do 25: t = FALSE, l = 0 26: sample a user goal G 27: sample user action a u from G 28: generate an initial dialogue state s 29: while t is FALSE ∧ l ≤ L do 30: with probability select a random action a 31: otherwise select a = argmax a Q(s, a ; θQ) 32: execute a 33: world model responds with a u , r and t 34: update dialogue state to s 35: store (s, a, r, s ) to D s 36: l = l + 1, s = s 37: end while 38: sample random minibatches of (s, a, r, s ) from D s 39: update θQ via Z-step minibatch Q-learning according to Equation (2)  40: end for 41: # Planning ends 42: every C steps reset θ Q = θQ 43: end for by the same DQN algorithm, operating on real experience in D u for learning and on simulated experience in D s for planning. Thus, here we only describe the way the simulated experience is generated.
Similar to Schatzmann et al. (2007), at the beginning of each dialogue, we uniformly draw a user goal G = (C, R), where C is a set of con-straints and R is a set of requests (line 26 in Algorithm 1). For movie-ticket booking dialogues, constraints are typically the name and the date of the movie, the number of tickets to buy, etc. Requests can contain these slots as well as the location of the theater, its start time, etc. Table 3 presents some sampled user goals and dialogues generated by simulated and real users, respectively. The first user action a u (line 27) can be either a request or an inform dialogueact. A request, such as request(theater; moviename=batman), consists of a request slot and multiple ( 1) constraint slots, uniformly sampled from R and C, respectively. An inform contains constraint slots only. The user action can also be converted to natural language via NLG, e.g., "which theater will show batman?" In each dialogue turn, the world model takes as input the current dialogue state s and the last agent action a (represented as an one-hot vector), and generates user response a u , reward r, and a binary variable t, which indicates whether the dialogue terminates (line 33). The generation is accomplished using the world model M (s, a; θ M ), a MLP shown in Figure 3, as follows: where (s, a) is the concatenation of s and a, and W and b are parameter matrices and vectors, respectively.
Task-Specific Representation s: state a: agent action a u r t

Shared layers
Figure 3: The world model architecture.

World Model Learning
In this process (lines 19-22 in Algorithm 1), M (s, a; θ M ) is refined via minibatch SGD using real experience in the replay buffer D u . As shown in Figure 3, M (s, a; θ M ) is a multi-task neural network (Liu et al., 2015) that combines two classification tasks of simulating a u and t, respectively, and one regression task of simulating r. The lower layers are shared across all tasks, while the top layers are task-specific.

Experiments and Results
We evaluate the DDQ method on a movie-ticket booking task in both simulation and human-in-theloop settings.

Dataset
Raw conversational data in the movie-ticket booking scenario was collected via Amazon Mechanical Turk. The dataset has been manually labeled based on a schema defined by domain experts, as shown in Table 4, which consists of 11 dialogue acts and 16 slots. In total, the dataset contains 280 annotated dialogues, the average length of which is approximately 11 turns.

Dialogue Agents for Comparison
To benchmark the performance of DDQ, we have developed different versions of task-completion dialogue agents, using variations of Algorithm 1.
• A DQN agent is learned by standard DQN, implemented with direct reinforcement learning only (lines 5-18 in Algorithm 1) in each epoch. • The DDQ(K) agents are learned by DDQ of Algorithm 1, with an initial world model pretrained on human conversational data, as described in Section 3.1. K is the number of planning steps. We trained different versions of DDQ(K) with different K's. • The DDQ(K, rand-init θ M ) agents are learned by the DDQ method with a randomly initialized world model. • The DQN(K) agents are learned by DQN, but with K times more real experiences than the DQN agent. DQN(K) is evaluated in the simulation setting only. Its performance can be viewed as the upper bound of its DDQ(K) counterpart, assuming that the world model in DDQ(K) perfectly matches real users.
Implementation Details All the models in these agents (Q(s, a; θ Q ), M (s, a; θ M )) are MLPs with tanh activations. Each policy network Q(.) has one hidden layer with 80 hidden nodes. As shown in Figure 3, the world model M (.) contains two shared hidden layers and three task-specific hidden layers, with 80 nodes in each. All the agents are trained by Algorithm 1 with the same set of hyper-parameters. -greedy is always applied for exploration. We set the discount factor γ = 0.95. The buffer sizes of both D u and D s are set to 5000. The target value function is updated at the end of each epoch. In each epoch, Q(.) and M (.) are refined using one-step (Z = 1) 16-tupleminibatch update. 4 In planning, the maximum length of a simulated dialogue is 40 (L = 40).
In addition, to make the dialogue training efficient, we also applied a variant of imitation learning, called Reply Buffer Spiking (RBS) (Lipton et al., 2016). We built a naive but occasionally successful rule-based agent based on human conversational dataset (line 1 in Algorithm 1), and prefilled the real experience replay buffer D u with 100 dialogues of experience (line 2) before training for all the variants of agents.

Simulated User Evaluation
In this setting the dialogue agents are optimized by interacting with user simulators, instead of real users. Thus, the world model is learned to mimic user simulators. Although the simulator-trained agents are sub-optimal when applied to real users due to the discrepancy between simulators and real users, the simulation setting allows us to perform a detailed analysis of DDQ without much cost and to reproduce the experimental results easily. 4 We found in our experiments that setting Z > 1 improves the performance of all agents, but does not change the conclusion of this study: DDQ consistently outperforms DQN by a statistically significant margin. Conceptually, the optimal value of Z used in planning is different from that in direct reinforcement learning, and should vary according to the quality of the world model. The better the world model is, the more aggressive update (thus bigger Z) is being used in planning. We leave it to future work to investigate how to optimize Z for planning in DDQ.  (10)  User Simulator We adapted a publicly available user simulator (Li et al., 2016b) to the taskcompletion dialogue setting. During training, the simulator provides the agent with a simulated user response in each dialogue turn and a reward signal at the end of the dialogue. A dialogue is considered successful only when a movie ticket is booked successfully and when the information provided by the agent satisfies all the user's constraints. At the end of each dialogue, the agent receives a positive reward of 2 * L for success, or a negative reward of −L for failure, where L is the maximum number of turns in each dialogue, and is set to 40 in our experiments. Furthermore, in each turn, the agent receives a reward of −1, so that shorter dialogues are encouraged. Readers can refer to Appendix B for details on the user simulator.

Results
The main simulation results are reported in Table 1 and Figures 4 and 5. For each agent, we report its results in terms of success rate, average reward, and average number of turns (averaged over 5 repetitions of the experiments). Results show that the DDQ agents consistently outperform DQN with a statistically significant margin. Figure 4 shows the learning curves of different DDQ agents trained using different planning steps. Since the training of all RL agents started with RBS using the same rule-based agent, their performance in the first few epochs is very close.  and DDQ(10) took only 50 epochs. Intuitively, the optimal value of K needs to be determined by seeking the best trade-off between the quality of the world model and the amount of simulated experience that is useful for improving the dialogue agent. This is a non-trivial optimization problem because both the dialogue agent and the world model are updated constantly during training and the optimal K needs to be adjusted accordingly. For example, we find in our experiments that at the early stages of training, it is fine to perform planning aggressively by using large amounts of simulated experience even though they are of low quality, but in the late stages of training where the dialogue agent has been significantly improved, low-quality simulated experience is likely to hurt the performance. Thus, in our implementation of Algorithm 1, we use a heuristic 5 to reduce the value of K in the late stages of training (e.g., after 150 epochs in Figure 4) to mitigate the negative impact of low-qualify simulated experience. We leave it to future work how to optimize the planning step size during DDQ training in a principled way. Figure 5 shows that the quality of the world model has a significant impact on the agent's performance. The learning curve of DQN(10) indicates the best performance we can expect with a perfect world model. With a pre-trained world model, the performance of the DDQ agent improves more rapidly, although eventually, the DDQ and DDQ(rand-init θ M ) agents reach the same success rate after many epochs. The world model learning process is crucial to both the efficiency of dialogue policy learning and the final performance of the agent. For example, in the early stages (before 60 epochs), the performances of DDQ and DDQ(fixed θ M ) remain very close to each other, but DDQ reaches a success rate almost 5 The heuristic is not presented in Algorithm 1. Readers can refer to the released source code for details. 10% better than DDQ(fixed θ M ) after 400 epochs.

Human-in-the-Loop Evaluation
In this setting, five dialogue agents (i.e., DQN, DDQ(10), DDQ(10, rand-init θ M ), DDQ(5), and DDQ(5, rand-init θ M )) are trained via RL by interacting with real human users. In each dialogue session, one of the agents was randomly picked to converse with a user. The user was presented with a user goal sampled from the corpus, and was instructed to converse with the agent to complete the task. The user had the choice of abandoning the task and ending the dialogue at any time, if she or he believed that the dialogue was unlikely to succeed or simply because the dialogue dragged on for too many turns. In such cases, the dialogue session is considered failed. At the end of each session, the user was asked to give explicit feedback whether the dialogue succeeded (i.e., whether the movie tickets were booked with all the user constraints satisfied). Each learning curve is trained with two runs, with each run generating 150 dialogues (and K * 150 additional simulated dialogues when planning is applied). In total, we collected 1500 dialogue sessions for training all five agents.

Conclusion
We propose a new strategy for a task-completion dialogue agent to learn its policy by interacting with real users. Compared to previous work, our agent learns in a much more efficient way, using only a small number of real user interactions, which amounts to an affordable cost in many nontrivial domains. Our strategy is based on the Deep Dyna-Q (DDQ) framework where planning is integrated into dialogue policy learning. The effectiveness of DDQ is validated by human-in-theloop experiments, demonstrating that a dialogue agent can efficiently adapt its policy on the fly by interacting with real users via deep RL.
One interesting topic for future research is exploration in planning. We need to deal with the challenge of adapting the world model in a changing environment, as exemplified by the domain extension problem (Lipton et al., 2016). As pointed out by Sutton and Barto (1998), the general problem here is a particular manifestation of the conflict between exploration and exploitation. In a planning context, exploration means trying actions that may improve the world model, whereas exploitation means trying to behave in the optimal way given the current model. To this end, we want the agent to explore in the environment, but not so much that the performance would be greatly degraded.

B User Simulator
In the task-completion dialogue setting, the entire conversation is around a user goal implicitly, but the agent knows nothing about the user goal explicitly and its objective is to help the user to accomplish this goal. Generally, the definition of user goal contains two parts: • inform slots contain a number of slot-value pairs which serve as constraints from the user. • request slots contain a set of slots that user has no information about the values, but wants to get the values from the agent during the conversation. ticket is a default slot which always appears in the request slots part of user goal. To make the user goal more realistic, we add some constraints in the user goal: slots are split into two groups. Some of slots must appear in the user goal, we called these elements as Required slots. In the movie-booking scenario, it includes moviename, theater, starttime, date, numberofpeople; the rest slots are Optional slots, for example, theater chain, video format etc.
We generated the user goals from the labeled dataset mentioned in Section 3.1, using two mechanisms. One mechanism is to extract all the slots (known and unknown) from the first user turns (excluding the greeting user turn) in the data, since usually the first turn contains some or all the required information from user. The other mechanism is to extract all the slots (known and unknown) that first appear in all the user turns, and then aggregate them into one user goal. We dump these user goals into a file as the user-goal database. Every time when running a dialogue, we randomly sample one user goal from this user goal database.