Integrating planning for task-completion dialogue policy learning

Training a task-completion dialogue agent with real users via reinforcement learning (RL) could be prohibitively expensive, because it requires many interactions with users. One alternative is to resort to a user simulator, while the discrepancy of between simulated and real users makes the learned policy unreliable in practice. This paper addresses these challenges by integrating planning into the dialogue policy learning based on Dyna-Q framework, and provides a more sample-efficient approach to learn the dialogue polices. The proposed agent consists of a planner trained on-line with limited real user experience that can generate large amounts of simulated experience to supplement with limited real user experience, and a policy model trained on these hybrid experiences. The effectiveness of our approach is validated on a movie-booking task in both a simulation setting and a human-in-the-loop setting.


Introduction
Dialogue policy learning for task-completion dialogue can be formulated as a sequential decision problem.And reinforcement learning has been widely explored to leverage user interactions to improve dialogue agents.Typically, reinforcement learners need an environment to operate in, and a dialogue agent should learn from interactions with real users in an on-line fashion to improve and adapt over time.However, to our best knowledge, there has been few prior work on learning dialogue policies via RL algorithms with real users in an online manner.The biggest challenge is that the high sample complexity of the algorithms demands a large number of samples from the environment, as dialogue policy learning with real humans is a restricted trial-and-error setting.Since it is expensive and laborious to acquire large amount of dialogue examples from real users, learning from scratch with real users via RL is rendered impractical due to limited training data.
To tackle this challenge, one popular approach is to build a user simulator (Schatzmann et al., 2007;Li et al., 2016b) based on a corpus of example dialogues.Then, reinforcement learning agents can be trained in an online fashion by interacting with simulated users.Recently, there has been some studies in developing dialogue agents for task-completion dialogue with user simulators using deep neural networks in conjunction with model-free reinforcement learning (Su et al., 2016a;Lipton et al., 2016;Zhao and Eskenazi, 2016;Williams et al., 2017;Dhingra et al., 2017;Li et al., 2017;Liu and Lane, 2017;Peng et al., 2017b;Budzianowski et al., 2017).These studies showed that dialogue agents trained with user simulators can serve as an effective starting point, which can be later on deployed to real humans to improve via reinforcement learning.
Though the user simulator alleviates the high sample complexity issue of the algorithms, it raises another issue -the discrepancy between simulated users and real users.Ideally, the user simulator can provide unlimited dialogue examples without expensive cost, which allows the RL agent to explore much larger policy space.However, simulation-based approach is still a controversial issue in dialogue research community, because the evaluation of user simulator itself remains an open problem, and there is no universally accepted metric (Pietquin and Hastie, 2013).And therefore, to the best of our knowledge, there is no standard way to build a user simulator.The dis-crepancy between simulated users and real users make the learned policy unreliable in practice, and hard to be deployed in reality, may further hinder the deployment for the dialogue systems in the real world settings.
Given these challenges, in this work, we integrate planning for task-completion dialogue policy learning, and propose a new dialogue agent RL planner based on Dyna-Q framework (Sutton, 1990).First, the agent is equipped with a planner, which is trained on-line with limited real user experience, to mimic the behavior of users.With this internal simulated user as a planner, the agent can plan multiple steps and reason about the future, and generate large amounts of simulated experience to supplement with limited real user experience, avoiding the sequences of trailand-error in the real environment, which leads to greater data efficiency.Then, a policy model (policy learner) can be trained on these hybrid experiences (including simulated and real experience).
Our simulation experiments on a movie-booking task show that the proposed agent significantly improves the sample efficiency to learn the dialogue policy.And human-in-the-loop experiments show that this data-efficient agent, enables a viable approach to train the dialogue policy via reinforcement learning on-line with real humans, and is scalable to be deployed in the real-world dialogue systems.Our main contributions are three-fold: • To the best of our knowledge, this is the first work that strives to incorporate planning into the dialogue policy learning based on Dyna-Q framework.• The sample-efficiency of the proposed agent, enables a viable approach to learn the dialogue policy with human-in-the-loop via reinforcement learning in an on-line fashion for task-completion dialogue systems.• We validate the effectiveness of the proposed approach in a movie-ticket booking task in both simulation and human-in-the-loop experiments.

Related Work
Recently, task-completion dialogue systems have attracted numerous research efforts, and there is growing interest in leveraging reinforcement learning for dialogue policy learning.One line of research is working on single-domain taskcompletion dialogues with flat deep reinforcement learning algorithms such as DQN (Zhao and Eskenazi, 2016;Lipton et al., 2016;Li et al., 2017), Actor-Critic (Peng et al., 2017a;Liu and Lane, 2017) and policy gradients (Williams et al., 2017;Liu et al., 2017).Another line of research work addresses more complex setting -multi-domain dialogues (Gašić et al., 2015b,a;Cuayáhuitl et al., 2016;Budzianowski et al., 2017;Peng et al., 2017b).All these dialogue agents were trained with simulated users via model-free reinforcement learning.
Despite the widespread interest in exploiting reinforcement learning for task-completion dialogue systems, it is still impractical to learn the dialogue policy from scratch with real users, because RL learners need an environment to interact with, and RL algorithms typically need many samples from the environment.While, the dialogue policy learning with real humans is a restricted trial-and-error setting, it is quite costly to acquire large amounts of dialogue examples with real users.To overcome this limitation, one popular approach in the dialogue research community is to train RL agents using simulated users for this purpose (Li et al., 2017;Liu and Lane, 2017).Many models (Eckert et al., 1997;Levin et al., 2000;Scheffler and Young, 2002;Georgila et al., 2005;Cuayáhuitl et al., 2005;Pietquin, 2006;Pietquin and Dutoit, 2006;Schatzmann et al., 2006;Frampton and Lemon, 2006;Schatzmann et al., 2007;Li et al., 2016b;Asri et al., 2016) have been introduced for user modeling in different dialogue systems.Given the reliance of the research community on user simulations, it seems important to assess the quality of the simulator.How best to assess a user simulator remains an open issue, and there is no universally accepted metric (Pietquin and Hastie, 2013).One important feature of a good user simulator requires coherent behavior throughout the dialogue; ideally, a good metric should measure the correlation between user simulation and real human behaviors, but it is hard to find a widely accepted metric.Therefore, to the best of our knowledge, there is no standard way to build a user simulator.As a rule of thumb, most of trained agents have to be evaluated on real human as well.The controversial issue of user simulators also brings uncertainty and inconsistency to the dialogue agents trained on these simulated users.Dhingra et al. (2017) trained a dialogue agent for information access, which gains the highest suc-cess rate in the simulation evaluation, but it performs poorly against real users, which means the dialogue agent is tailored to overfit with specific users.
A dialogue policy can be either pre-trained with supervised learning based on annotated humanhuman or human-machine conversation data, or trained with reinforcement learning on simulated users, then the dialogue agent can serve as an effective starting point to deploy against real humans to improve further via reinforcement learning.However, the discrepancy between simulated users and real users make the learned policy unreliable in practice, and hard to be deployed in reality, may further hinder the deployment for the dialogue systems in the real world settings.So far there is no work to directly learn the dialogue policy via RL algorithms in an on-line manner with real users for task-completion dialogue systems.Li et al. (2016a) firstly trained an agent to interact with a teacher (a simulator or a Turker) to improve its question-answering ability from the feedback of teachers via reinforcement learning.Su et al. (2016b) presented an on-line reward learning to optimize the dialogue policy with a Gaussian process model.
Deep reinforcement learning, especially modelfree based methods, have witnessed remarkable success in various tasks (Mnih et al., 2015;Silver et al., 2016aSilver et al., , 2017)).However, model-free methods tend to be slower and with high sample complexity.Model-based methods have been considered more efficient.AlphaGo (Silver et al., 2016a), the world champion computer Go system, trains a policy to decide how to expand the search tree using a known environment or world dynamic.There have also been various research efforts on tasks where environment dynamics are not easy to get.The classic "Dyna" algorithm (Sutton, 1990) learns a model and uses it as a pseudo environment to train a policy.Value iteration network (Tamar et al., 2016) implicitly incorporates planning into policy learning through iterative rollouts.The predictron (Silver et al., 2016b) learns an abstract model for reward process and only applied to evaluate a proposed action from model.Gu et al. (2016) proposed to use locally linear models with local on-policy imagination rollouts to accelerate model-free continuous Q-learning on robotic tasks.Imagination-augmented agents (Racanière et al., 2017) improves policy learning by aggre-gating rollouts information provided by a model.Nevertheless, the performance relies on the pretrained observation-space model and might not be able to scale to complex dynamic environment like dialogues.As illustrated in Figure 1, a task-completion dialogue system mainly comprises of three modules: language understanding (LU) transforms the natural language from users to system-readable semantic frames; natural language generation (NLG) that converts system actions into natural language to user; the last but most important module is dialogue manager (DM) which keeps tracking dialogue stats and controls dialogue policy.In this paper, we focus on dialogue policy learning, which is to be formulated as a Markov Decision Process (MDP) with the goal of maximizing a long-term objective associated with a reward function.

Methodology
We extend Dyna-Q framework (Sutton, 1990) and propose RL planner agent for dialogue policy learning.Similarly, our RL planner agent has a planner which is learnable model to mimic the behavior of real user, and a model-free policy model that learns dialogue policy through interacting with real user and planner.In this work, the planner model is represented with neural networks to mimic the behavior of users, while, in the original Dyna framework (Sutton, 1990), the planner is a tabular method, which may restrict its planning capability to only these previous known states and actions, and fails to generalize well to unseen state and action.The real experience that generated from interaction with real user have two roles: can be used to directly improve the policy model and update the planner.
Thus, there exist two types of process in RL planner agent: • Learning: a process that the agent interacts with users to generate real experience and improve policy.• Planning: a process that the agent interacts with planner rather than users to generate hypothetical or simulated experiences, and improve policy.The learning process is also termed as direct method which is simpler to use and achieved well performance on various tasks.However, it requires tons of interactions with environment.The planning process or named indirect methods is able to make full use of real experience and get better policy with less environment interactions.In our proposed RL planner agent, we combined these two processes to improve dialogue policy learning with much fewer interactions with user.
As illustrated in Figure 2, for each dialogue session, the agent alternatively chooses either learning or planning.If learning, the agent produces an action which is then passed to user for response, which is termed as direct RL update.Otherwise planning, the agent produces an action which is then executed by planner, termed as planning update.

Agent
Figure 2: The architecture of dialogue policy learning with planning

Learning
The goal of policy model is to learn a policy π that controls a dialogue system with states s ∈ S and actions a ∈ A in order to maximize the expected sum of reward R over all possible dialogue trajectories in accordance with a reward function r.The expected sum of reward is defined as where r t denotes the reward provided by user or planner at time t and γ is the discount factor.The policy π can be optimized by a variety of modelfree RL algorithms, such as Deep Q-Networks (DQN), Actor-Critic Algorithms.Here we choose DQN as our policy model as it showed remarkable performance on a task-completion dialogue system (Li et al., 2017).
To be more specific, the policy model estimates an action-value function Q * (s, a) = E[R|s t = s, a t = a] that can be recursively written as The Q * (s, a) function is parameterized by neural network Q(s, a; θ) and its parameter θ is optimized towards Q * (s, a) that corresponding the optimal policy by minimizing the following loss function: We use RMSprop to minimize above loss function with gradient as : Two most commonly used tricks, experience reply and target networks are employed to boost the performance and stabilize training.

Planning
As aforementioned, planning is a process in which the agent uses a learned world model to improve the policy.In this paper, we refer the learned world model as planner which aims to mimic the behavior of users.It takes the the dialogue stats s ∈ S and agent action a A ∈ A agent as input and maps s, a A to consequent user action a U ∈ A user , scalar rewards r ∈ R and binary termination t ∈ {0, 1}.
We formulate it as a multi-task learning problem with two classification tasks and one regression task.Several different modeling choice can be applied for this purpose, e.g. a convolutional neural network or recurrent neural network, in this paper we simply use a multi-layer perceptron for the planner model.
Suppose that we are given a sample (s t , a A t , r t , a U t , t t ) from user experience buffer, the planner makes predictions as follows: where W and b are weight matrix learned with gradient descent.
The interaction procedure of RL planner agent is as follows: for each dialogue session, users and planner are alternatively interact with policy model.Firstly, users interact with policy model and accumulate real dialogue experience tuples, which is then used to train both policy model and planner.Afterwards, the planner is utilized to interact with policy model for k number of sessions, this will generate k times of simulated dialogue experience tuples, and these tuples will be used to update policy model.In this way, the RL planner agent combines learning and planning for dialogue policy learning.It better exploits real dialogue experience, and helps exploration to states not previously experiences since the planner model is represented as neural networks which can generalize to the unseen states and actions.A detailed summary of training the RL planner agent can be found in algorithm 1.

Simulation Experiments
In this work, we consider a task-completion dialogue system which attempts to assist users to book movie tickets.To give a systematic analysis about sample efficiency of the proposed RL planner agent, we conduct intensive experiments with simulated users since it is cheaper and repeatable of experiments.

Dataset
The raw conversational data for movie-ticket booking task was collected via Amazon Mechanical Turk, with annotations provided by domain experts.In total, we have labeled 280 dialogues, and the average number of turns per dialogue is approximately 11.The annotated data includes 11 dialogue acts and 29 slots, most of the slots are informable slots, which users can use to constrain the search, and some are requestable slots, of which users can ask values from the agent.For end for 25: end for example, numberofpeople cannot be a requestable slot, since arguably user knows how many tickets he or she wants to buy.The detailed data schema information can be found in Appendix A.

User Simulator
We adapted a publicly available user simulator (Li et al., 2016b) to the task-completion dialogue setting with the dataset described in Section 4.1.During training, the simulator provides the agent with a reward signal at the end of the dialogue.A dialogue is considered to be successful only when a movie ticket is booked successfully, and the information provided by the agent satisfies user's constraints.At the end of each dialogue, the agent receives a positive reward of 2 * max turn for success, or a negative reward of −max turn for failure; in our experiments we set max turn to 20.Furthermore, at each turn, the agent receives a reward of −1, so that shorter dialogues are encouraged.The details of user simulator can be found in Appendix B.

Agent Comparison
We benchmarked the proposed RL planner agents against its model-free baseline RL agent: • The RL agent is a standard model-free reinforcement learning model (DQN), where a neural network is applied to approximate the Q-function.

Implementation
For all the agents, we set the size of hidden layer of 80 for both policy model and planner.RMSprop was applied to optimize the parameters.We set batch size to 16 for both policy model and planner.During training, we used the -greedy strategy for exploration, γ is 0.95.For each simulation epoch, we simulated one dialogue and stored these real state transition tuples in an experience replay buffer.For RL planner (k) agents, k times of simulated dialogue experiences will be generated into the experience replay buffer.At the end of each simulation epoch, the policy model was updated with all the transition tuples in the buffer in a batch manner.The process of training the planner is only with real dialogue experience tuples.Experience replay strategy plays critical role in the success of deep reinforcement learning.In all our experiments, before training, we use a rulebased agent to run N (N = 100) dialogues to ac-cumulate the experience replay buffer.This procedure can be viewed as an implicit way of imitation learning to initialize the RL agent and/or planner.Afterwards, the agent accumulates all the states transition tuples and flushes the replay buffer after the agent reaches a success rate threshold, e.g. the equivalent performance of the rule agent.

Experiments with Planners
To investigate the potential improvement space for the proposed RL planner (k) agent, the baseline RL agent trained based on real experience replay buffer serves as the lower bound of the performance, RL planner-ground (k) agent trained with extra k times of real dialogue experience reply buffer can be viewed as the upper bound of the performance.Figure 3 shows the performance curves of three agents with planning steps k = 9, the huge gap between RL agent and RL plannerground (k = 9) agent shows that RL agent needs to acquire more (k = 9 times) real dialogue examples to close this gap, while in reality, this is quite costly to obtain so many real dialogue experiences.However, the proposed RL planner (k = 9) agent can make up this gap, with the same number of real dialogue examples of RL agent, because RL planner (k = 9) agent can provide extra 9 times of simulated dialogue experience to train the agent policy model.It significantly improves the sample efficient of the model, moves close to the upper bound of RL planner-ground (k = 9) agent.

Experiments with Boosted Planners
One of the problems in the proposed RL planner agent is that the world model of the planner is far to be perfect to the real world at the early stage of learning, which may generate a lot of negative experiences, may misguide the policy model of the agent.To mitigate the issue, the planner model of the agent can be pre-trained on a labeled dataset, then continue to be trained with the on-line interaction of users, to adapt the user behaviors.In our experiments, we called this agent as RL boostedplanner agent.
Figure 4 shows the performance curves of four agents with different planner models.RL agent has no planner model, the planner model of RL planner agent is learned from the scratch, RL boosted-planner agents are equipped with a pretrained planner model, but the planner model of RL boosted-planner (k = 9, no train) agent is fixed without training during the interaction with users, while the planner model of RL boostedplanner (k = 9) agent is trained to adapt to the users during the interactions.All RL planner agent and RL boosted-planner agents are with k = 9 planning steps.
We can find that RL boosted-planner agent is learning faster than RL planner agent.While, a different observation is though RL boostedplanner (k = 9, no train) is learning faster than RL planner (k = 9) agent at the early stage, it cannot adapt to the user behaviors since the planner model is not being trained during the interactions with users, the performance of RL boostedplanner (k = 9, no train) agent cannot reach the same level of RL planner (k = 9) agent.Furthermore, we investigate the impact of different planning steps k to the sample efficiency issue.
Figure 5 shows the performance curves of four agents with planning steps k = 0, 1, 4, 9, where k = 0 case involves no planning, just is trained on one time of real dialogue experience reply buffer, which is the baseline RL agent.We still can find the gaps among different planning steps, a larger planning steps k needs fewer dialogue examples to reach a fixed success rate (for example 60%) at the early stage, can be much more sample-efficient, which is an important benefit for on-line humanin-the-loop dialogue policy learning, but a much larger k (i.e.k = 14 in our experiment) may increase the amount of internal simulation, and also may risk to generate a lot of negative examples when the planner model is not good enough, which may lead the policy model to some local minimas.
Table 1 shows the test performance of five agents chosen at three different training epochs = {100, 200, 300}, each number was averaged over 5 runs and each run generated 2000 simulated dialogues.At these training epochs, RL planner agent is better than RL agent, RL boosted-planner agent is better than RL planner agent, and for the same agent, a larger planning steps k may help.Appendix C shows one sample dialogue generated by the planner of RL boosted-planner (k = 4) agent interacting with its policy model.5 Human-in-the-loop Experiments To further verify the sample efficiency of the proposed RL planner agent in real world setting, we compare it against its model-free agent (RL agent) in an on-line human experiment -dialogue policy learning with real users.We only benchmarked three agents: RL, RL planner (k = 4), and RL boosted-planner (k = 4).

Experiment Settings and Results
In the human experiments, one agent was randomly selected to interact with a user, to avoid the systematic bias, the user was not aware of which agent was selected.At the beginning of each dialogue session, the user was presented with a goal sampled from the user-goal corpus, then was instructed to converse with the agent to complete the given task.Over the course of conversation, the dialogue agents gather information about the customer's desires and ultimately books the movie tickets.The environment then assesses a binary outcome (success or failure) at the end of the conversation, based on (1) whether a movie is booked, and (2) whether the movie satisfies the users constraints.All three of above agents were trained with real users recruited from the authors' affiliation, each curve was averaged over 2 runs and each run was trained with 150 dialogues.In total, we collected 900 dialogue sessions.For all the agents, at the beginning of each training epoch, the user interacts with the agents to generate one real training dialogue example; for RL planner (k) agent and RL boosted-planner (k) agent, the planner model will roll-out k times of simulated dialogue examples into the experience reply buffer, then the agents are trained based on the experience buffer for one epoch.Figure 6 shows the learning curves of these agents trained with real users in terms of success rate, we can find that: • RL planner agent are learning much faster than RL agent, indicting that RL planner agent is much sample efficient, significantly reduce the number of dialogues to train a good dialogue agent in the real world.• RL boosted-planner agent is better than RL planner agent, suggesting that a pre-trained world model can further accelerate the training of RL planner agent.

Conclusion
In this paper, we propose to integrate planning into dialogue policy learning based on Dyna-Q framework, and present a RL planner agent, which consists of a planner and a policy learner, the planner is trained on limited real user experience, and endows the agent with a learned simulated user, which can plan multiple steps and reason about the future, generate large amounts of simulated experience to supplement with the limited real user experience, avoiding the sequences of trailand-error in the real environment, then the policy learner can train a dialogue policy model based on these hybrid experience.Our human-in-the-loop experiments show this data-efficient agent enables to directly learn a dialogue policy via RL with real users in an on-line fashion, and is scalable to be deployed in the real-world dialogue scenarios.
Through the intensive analysis under simulation, we find that a boosted planner model and a reasonable larger planning steps k can further improve the sample efficiency of the agent.These promising results suggest several interesting directions for future research.First, the efficiency of on-line human-in-the-loop dialogue policy learning for single-domain task-completion dialogue system motivates us to investigate its effectiveness in a more complex, multi-domain taskcompletion dialogue setting, where the action and state spaces are much larger, which may complicate the planner model.Second, it would be valuable to explore and equip the planning in the hierarchical reinforcement learning algorithms.Third, the boosted planner model of agent demonstrates its strong adaptation ability to tailor the dialogue policy learning to different users, this motivates us to systematically investigate its use for dialogue personalization.

Figure 1 :
Figure 1: Illustration of the Task-Completion dialogue system with planner

Figure 3 :
Figure 3: Learning curves of the RL planner (k = 9) agent with its ground-truth RL planner-ground (k = 9) agent, and RL agent.

Figure 6 :
Figure 6: Human-in-the-loop Dialogue Policy Learning Curves for three different agents: x-axis is the number of training epochs.