Learning Efficient Dialogue Policy from Demonstrations through Shaping

Training a task-oriented dialogue agent with reinforcement learning is prohibitively expensive since it requires a large volume of interactions with users. Human demonstrations can be used to accelerate learning progress. However, how to effectively leverage demonstrations to learn dialogue policy remains less explored. In this paper, we present Sˆ2Agent that efficiently learns dialogue policy from demonstrations through policy shaping and reward shaping. We use an imitation model to distill knowledge from demonstrations, based on which policy shaping estimates feedback on how the agent should act in policy space. Reward shaping is then incorporated to bonus state-actions similar to demonstrations explicitly in value space encouraging better exploration. The effectiveness of the proposed Sˆ2Agentt is demonstrated in three dialogue domains and a challenging domain adaptation task with both user simulator evaluation and human evaluation.


Introduction
With the flourishment of conversational assistants in daily life (like Google Assistant, Amazon Alexa, Apple Siri, and Microsoft Cortana), task-oriented dialogues that are able to serve users on certain tasks have increasingly attracted research efforts. Dialogue policy optimization is one of the most critical tasks of dialogue modeling. One of the most straightforward approaches is the rule-based method, which contains a set of expert-defined rules for dialogue modeling. Though rule-based dialogue systems have a reasonable performance in some scenarios, handcrafting such kinds of rules is time-consuming and not scalable.
Recently, dialogue policy learning is formulated as a reinforcement learning (RL) problem and tackled with deep RL models Lipton * Corresponding author † Equal Contribution et al., 2018;Peng et al., 2017). It has shown great potentials of using the RL-based method for building robust dialogue systems automatically. However, due to its interactive nature, RL-based agents demand of an environment to operate in. As illustrated in Figure 1, RL-based dialogue agents need to interact with human users and update its policy in an online fashion requiring that the agents have a good online performance from the start of training. In addition, one of the biggest challenges of RL approaches is reward sparsity issue, which leads to exploration in large action space inefficient. As a consequence, training RL-based agents expects a prohibitively large number of interactions to achieve acceptable performance, which may incur a significant amount of expense (Pietquin et al., 2011;Peng et al., 2018b). Several attempts are made to improve learning efficiency and tackle reward sparsity issues. Different types of heuristics has been proposed in the form of intrinsic rewards to guide exploration more efficiently Mohamed and Rezende, 2015;Peng et al., 2017Peng et al., , 2018a.
When building a dialogue system, it is typically affordable to recruit experts to gather some demonstrations about the expected agent behaviors. We therefore aim to address the aforementioned challenges from a different perspective and assume having access to human-provided demonstrations. In this paper, we investigate how to efficiently leverage these demonstrations to alleviate reward sparsity and improve policy learning quality. Previous work  used a simple technique termed as Replay Buffer Spiking (RBS) to pre-fill experience replay buffer with human demonstrations, which yields good performance, especially in the beginning. (Hester et al., 2018) proposed Deep Q-learning from Demonstrations (DQfD) that combines temporal difference updates with a supervised classification loss of actions in demonstrations to improve learning efficiency in gaming domains. However, whether it is feasible and how to effectively leverage human demonstration in dialogue scenarios are less explored.
Hence, in this paper, we propose a new strategy of leveraging human demonstrations to learn dialogue policy efficiently. Our dialogue agent, termed as S 2 Agent 1 , learns dialogue policy from demonstrations trough policy shaping and reward shaping. Policy shaping (Griffith et al., 2013) is an approach to incorporating human feedback to advise how policy should behave like experts. It estimates feedback of a state-action pair from human demonstrations and then utilizes the feedback to reconcile the policy from any RL-based agents. This method speeds up learning progress in gaming domains but has not yet been studied in dialogue. However, directly applying policy shaping to dialogue faces several challenges. The original policy shaping uses a tabular analogous method to estimate feedback. This method limits its feasibility for complex problems like dialogue that has large state action representations. To deal with this issue, we propose to use deep neural networks, which represent state-action space with function approximation and distill knowledge from human demonstrations, to estimate feedback. In addition, policy shaping calibrates agents' behavior in policy space, and it is inherently not designed to tackle reward sparsity issues. Considering this, we further introduce reward shaping to bonus these state-action pairs that are similar to demonstrations. It can be viewed as a shaping mechanism explicitly in value space to guide policy exploration towards actions which human experts likely conduct. Our contributions in this work are two-fold: • We propose a novel S 2 Agent that can effectively leverage human demonstrations to improve learning efficiency and quality through policy shaping and reward shaping.
• We experimentally show that S 2 Agent can efficiently learn good policy with limited demonstrations on three single domain dialogue tasks and a challenging domain adaptation task using both simulator and human evaluations.
1 Agent with policy Shaping and reward Shaping 2 Related Work Dialogue policy learning Deep reinforcement learning (RL) methods have shown great potential in building a robust dialog system automatically (Young et al., 2013;Su et al., 2016;Williams et al., 2017;Peng et al., 2017Peng et al., , 2018aLipton et al., 2018;Li et al., 2020;Lee et al., 2019). However, RL-based approaches are rarely used in realworld applications, for these algorithms often require (too) many experiences for learning due to the sparse and uninformative rewards. A lot of progress is being made towards mitigating this sample complexity problem by incorporating prior knowledge. (Su et al., 2017) utilizes a corpus of demonstration to pre-train the RL-based models for accelerating learning from scratch. (Chen et al., 2017b) attempts to accelerate RL-based agents by introducing extra rewards from a virtual rule-based teacher. However, the method requires extra efforts to design a rule-based dialogue manager. (Hester et al., 2018) improve RL learning by utilizing a combination of demonstration, temporal difference (TD), supervised, and regularization losses. (Chen et al., 2017a) introduced a similar approach called companion teaching to incorporate human teacher feedback into policy learning. Nevertheless, companion teaching assumes that there is a human teacher to directly give a correct action during policy learning process and meanwhile train an action prediction model for reward shaping based on human feedback.
Policy shaping Policy Shaping is an algorithm that enables introducing prior knowledge into policy learning. (Griffith et al., 2013) formulates human feedback on the actions from an agent policy as policy feedback and proposes Advise algorithm to estimate humans Bayes feedback policy and combine it with the policy from the agent. It shows significant improvement in two gaming environment. (Misra et al., 2018) uses policy shaping to bias the search procedure towards semantic parses that are more compatible with the text and achieve excellent performance.
Reward shaping Reward shaping leverages prior knowledge to provides a learning agent with an extra intermediate reward F in addition to environmental reward r, making the system learn from a composite signal R + F (Ng et al., 1999). However, it is not guaranteed that with reward shaping, an MDP can still have an optimal policy that is identical to the original problem unless the shaping is potential-based reward shaping (Ng et al., 1999;Marthi, 2007). (Su et al., 2015) proposes to use RNNs to predict turn-level rewards and use the predicted reward as informative reward shaping potentials. (Peng et al., 2018a; use inverse reinforcement learning to recover reward functions from demonstrations for reward shaping. However, the estimated reward using these methods inevitably contains noise and failed to conform to potential-based reward function to guarantee the optimal policy. Inspired by (Brys et al., 2015), we directly estimate potential-based reward function from demonstrations.

Approach
Our S 2 Agent is illustrated in Figure 1, consisting of four modules. 1) Dialogue policy model which selects the best next action based on the current dialogue state.; 2) Imitation Model is formulated as a classification task that takes dialogue states as input and predicts associated dialogue action, aiming to distill behaviors from human demonstrations.; 3) Policy Shaping provides feedback on how policy should behave like demonstrations. It then reconciles a final action based on actions from the policy model and imitation model attempting to generate more reliable exploration trajectories; 4) Followed by a reward shaping module that encourages demonstration similar state-actions by providing extra intrinsic reward signals.

Policy Model
We consider dialogue policy learning as a Markov Decision Process (MDP) problem and improve the policy with Deep Q-network (DQN) (Mnih et al., 2015). 2 In each turn, the agent observes the dialogue state s, and then execute the action a with -greedy exploration that selects a random action with probability or adopts a greedy policy a = argmax a Q(s, a ; θ), where Q(s, a ; θ) approximates the value function, implemented as a multi-layer perceptron (MLP) parameterized by θ.
The agent then receives the reward r, perceives the next user response to a u , and updates the state to s . The tuple (s, a, r, s ) is stored in the experience replay D a . This loop continues until the dialogue terminates. The parameters of Q(s, a ; θ) are updated by minimizing the following square loss with stochastic gradient descent: where γ ∈ [0, 1] is a discount factor, and Q(.) is the target value function that is only periodically updated (line 26 in Algorithm 1). By differentiating the loss function with regard to θ, we derive the following gradient: As shown in lines 25-26 in Algorithm 1, in each iteration, we update Q(.) using minibatch Deep Q-learning.

Imitation Model
We assume having access to a corpus of humanhuman dialogues either from a log file or provided by recruited experts, which in this paper are termed as human demonstrations D e . D e usually consists of a set of state-action pairs [(s 1 , a 1 ), (s 2 , a 2 ), ..., (s n , a n )]. Theoretically, if D e is large enough to cover all the possible states, then the agent can respond perfectly by looking up the corresponding action from D e . However, in practice, D e is usually limited and can not cover all the states. Hence, we propose to use a supervised learning model (denoted as Imitation Model) to parameterize the relation of the states and actions expecting it to generalize to unseen state. We formulate the task as a classification problem. It takes dialogue s i as input and is trained with cross-entropy to minimize loss between action a i and predicted action a. There are multiple models like RNN, CNN can be used for this purpose, but for simplicity, we choose to use MLP.

Policy Shaping
Incorporating human feedback into RL can accelerate its learning progress (Griffith et al., 2013;Cederborg et al., 2015). Policy shaping is a representative that estimates human's Bayes optimal feedback policy and then combine the feedback policy with the policy of an underlying RL model. The feedback policy is computed with the following equation: where ∆ s,a is the difference between the number of positive feedback and negative feedback, i.e. the number of occurrence of (s, a) in human demonstrations. C here means the probability of consistency feedback from demonstrations 3 . For example, C = 0.7 means with 0.7 probability the feedback from the demonstrations is considered reliable. Otherwise, if C = 0.5, then policy shaping is meaningless since it treats every action equally. However, ∆ s,a is difficult to estimate from the demonstrations in dialogue scenarios since the state and action are large and sparse. To deal with this issue, we propose to use the aforementioned Imitation Model to estimate feedback from demonstrations. Specifically, we samples N times from imitation model policy π e (a|s) to form a committee a 1 , a 2 , ..., a N denoting N votes. Then we count for each action to generate c a as positive feedback from human demonstrations. We use the expectation of binomial distribution N * (1 − C) as the number of negative feedback. Such that, in dialogue, we use: Finally, the policy is reconciled from the policy model and the imitation model by multiplying them together: π(a|s) = π a (a|s) × π e (a|s) a π a (a|s) × π e (a|s) 3 It is a parameter to control noise in the demonstrations.
Policy shaping operates in the policy space and can be viewed as a mechanism of biasing the agent learning towards the policy distilled from the demonstrations to improve learning efficiency. The reconciled policy in equ. 5 allows the underlying RL model surpass the imitation model π e .
Algorithm 1 S 2 Agent learining algorithm . 1: init experience replay D a as empty. 2: init Q θ (s, a) and Q θ (s, a) with θ = θ . 3: init demo buffer D e with human conversation data. Train Expert with D e and load π e (a|s). 4: for n=1:N do 5: user starts a dialogue with user action a u . 6: init dialogue state s. 7: while s is not terminal do 8: with probability select a random action a. 9: otherwise select a = argmax a Q(s, a; θ). 10: #policy shaping starts 11: count the number of occurrence for each action and then compute ∆a with equ.4. 12: obtain shaped action distribution from policy shaper following equ.3. 13: reconcile the final action distribution as 5 and sample action a. 14: #policy

Reward Shaping
Most of the reward functions in dialogue scenarios are usually manually defined. Typically, a -1 for each turn and a significant positive or negative reward indicating the status of the dialogue at the end of a session. Such sparse reward is one of the reasons that RL agents have poor learning efficiency. Initially, the agents are fain to explore state-action uniformly at random. To this end, we propose to use reward shaping to integrate priors into RL learning to alleviate reward sparsity.
Reward shaping is a popular method to integrate prior knowledge into reward function to improve policy exploration (Brys et al., 2015). It provides the learning agent with an extra intermediate and task-related reward that enriches the original reward signal: Where F D denotes rewards from demonstrations. However, modifying the reward function may change the original MDPs and make the agent converge to a suboptimal point. (Wiewiora et al., 2003) proved that the MDP keeps unchanged and maintains convergency property if F D (·) is defined as: where φ D (s, a) is a potential function of stateaction pair. Its definition is intuitive. We bonus these policy paths that were consistent with the demonstrations. As such, the value of φ D (s, a) is expected to be high when action a is demonstrated in a state s d similar to s, and if s is completely different from s d a , φ D (s, a) should be close to 0. To achieve this, multi-variate Gaussian is used to compute the similarity between state-action pairs.
(8) We search through the demonstrations to obtain the sample with highest similarity: Using reward shaping to learn policy has several advantages. It leverages demonstrations to bonus these state-actions that are similar to demonstrations. The reward calculated from reward shaping is more informative and demonstration guided than the human-defined reward, which mitigates the reward sparsity issue to some degree.

Experiments and Results
We evaluate the proposed S 2 Agent with a user simulator on several public task-oriented datasets, including movie ticket booking, restaurant reservation, and taxi reservation. Additionally, to asses the generalization capability of shaping mechanism, We conduct domain adaptation experiments. Finally, human evaluation results are reported.

Dataset
The raw conversation data in the movie ticket booking task are collected through Amazon Mechanical Turk, and the data for the restaurant reservation and taxi calling scenario is provided by Microsoft Dialogue Challenge 4 . The three datasets have been manually labeled based on a schema defined by domain experts. We extend and annotated movie booking task with a payment scenario to simulate the situation of extending the dialogue system with new slots and values. All datasets contain 11 intents. The movie dataset contains 13 slots, and the other three contain 29 slots. Detailed information about the intents and slots is provided in Appendix A table 3.

Baseline Agents
To benchmark the performance of the shaping mechanism, we have developed different versions of task-completion dialogue agents for comparison as follows: • Imitation Model (IM) agent is implemented with Multi-Layer Perception and trained with the human demonstrations data to predict actions given dialogue states.
• DQN agent is learned with Deep Q-Network.
• EAPC Teaching via Example Action with Predicted Critique (EAPC) introduced in (Chen et al., 2017a) leverages real-time human demonstrations to improve policy learning. EAPC assumes the existence of human teachers during the learning process. It receives example actions from human teachers and, in the meantime, trains an action prediction model with the example actions as a critic for turn-level reward shaping. Since human teachers are not available in our case, we implement EAPC in the absence of teachers but use the same amount of human demonstrations to train a weak action prediction model. If the predicted action is identical to the action given by the policy model, the agent receives an extra positive reward otherwise an extra negative reward. This method can be viewed as a variant of S 2 Agent with only reward shaping using noise reward estimations from the imitation model.  loss from human demonstrations to DQN to ensure that the agent predicts correct actions on human demonstrated states. In the early learning phase, DQfD is trained only with the demonstrations to obtain a policy that mimics the human. Then, accumulated experiences mixed with the demonstration are used to train DQfD.
• S 2 Agent is our proposed agent that is trained with both policy shaping and reward shaping, as described in Algorithm 1.
• S 2 Agent w/o rs is a variant of S 2 Agent which learns policy with only policy shaping to reconcile the final action.
• S 2 Agent w/o ps is a variant of S 2 Agent but only has reward shaping to bonus state-actions similar to demonstrations. Implementation Details Imitation model agents for all domains are single layer MLPs with 50 hidden dimensions and tanh as the activation function. The IM agent is also used in policy shaping to reconcile the policy. All RL-based agents (DQN, DQfD, S 2 Agent ) are MLPs with tanh activations. Each policy network Q(.) has one hidden layer with 60 hidden nodes. All the agents are trained with the same set of hyper-parameters. -greedy is utilized for policy exploration. We set the discount factor as γ = 0.9. The target network is updated at the end of each epoch. To mitigate warm-up issues, We build a naive but occasionally successful rule-based agent to provide experiences in the beginning. For a fair comparison, we pre-fill the experience replay buffer D a with human demonstrations for all the variants of agents . Confidence factor C used in policy shaping is set 0.7. As for the reward shaping, γ in equ.7 is set as 1.

User Simulator
Training RL-based dialogue agents require an environment to interact with, and it usually needs a large volume of interactions to achieve good performance, which is not affordable in reality. It is commonly acceptable to employ a user simulator to train RL-based agents (Jain et al., 2018;Schatzmann et al., 2007). We adopt a public available agenda-based user simulator  for our experiment setup. During training, the simulator provides the agent with responses and rewards. The reward is defined as -1 for each turn to encourage short turns and a In addition, the average number of turns and the average reward are also reported to evaluate each model.

Simulator Evaluation
Main Results. The main simulation results are shown in Table 1 and Figure.2, 3, 4. The results show that with shaping mechanisms, S 2 Agent learns much faster and performs consistently better than DQN and DQfD in all the domains with a statistically significant margin. Figure 2 shows the learning curve of different agents in different domains. Firstly, the DQN agent performs better than the IM agent, which is not surprising since it interacts with the simulator and is optimized to solve user goals. DQfD and EAPC agents leverage human demonstrations to mitigate the reward sparsity issues. Their performances are consistently better than DQN. Besides, S 2 Agent w/o ps uses reward shaping to alleviate reward sparsity by bonusing additional rewards for states that are consistent with demonstrations. As a consequence, it performs better than DQN in all the domains. Though EAPC has a similar reward shaping mechanism, its reward estimation relies heavily on the qualify of the action prediction model. As such, EAPC performs slightly worse than S 2 Agent w/o ps. In addition, policy shaping reconciles the agent action with knowledge learned from human demonstrations. It biases the agent to explore these actions which human expert does. As shown in figure 2, S 2 Agent w/o rs learn the dialogue policy much faster than all the baselines. In the Movie domain, it achieves nearly a 60% success rate using only 20 epochs. By contrast, the second-best agent DQfD only achieves a 20% successful rate at epoch 20. Similar results are also observed in Restaurant and Taxi domains. When integrating both policy shaping and reward shaping to DQN, S 2 Agent achieves the best performance and is more data-efficient. For example, S 2 Agent in the Taxi domain achieves approximately 60% successful rate at 50 epoch while the following competitor only has around 40% successful rate. The above observation also confirms that policy shaping and reward shaping operate in different dimensions, which means policy shaping improves the learning by directly calibrating in the action space and reward shaping in the value function space, and are mutual-complementary. Noted that the improvement of combining policy shaping and reward shaping in the Movie domain is not as significant as that in Restaurant and Taxi. This is too large degree attributed to the increased complexity of Restaurant and Taxi dataset, which have two times more slots than the Movie dataset, meaning that the state-action space is much larger than the movie domain and posing more challenges in exploration. Under this situation, policy shaping and reward shaping benefit the S 2 Agent to a large extent.
Results of training with varying number of demonstrations. Intuitively, the number of human demonstrations has a large impact on policy learning. The imitation model agent might be able to summarize a good expert policy when a large volume of human demonstrations is available. However, we hope the shaping mechanism is capable of improving learning efficiency with limited human demonstrations for RL-base agents. As such, we experiment with different sizes of demonstrations between 25 and 125 to asses the effect of different numbers of human demonstration on learning efficiency and quality. Figure 3 shows the average performance of each agent during learning, which indicates the learning speed and quality. Our proposed shaping mechanisms improve policy learning speed and quality and are robust to the number of demonstrations. Even with the small number of human demonstrations as 25, S 2 Agent achieves a 5% higher success rate than DQfD and EAPC in the Movie domain and 10% in the Taxi domain. As the number of demonstrations increases, the gap between DQfD and S 2 Agent becomes larger, showing that policy mechanisms can still benefit from more human demonstrations available.
Results of domain extension Typically, RLbased agents are built with a fixed ontology. However, a dialogue system should be able to evolve as being used to handle new intents, slots, unanticipated actions from users. To asses the ability of quickly adapting to the new environment, we extend existing movie user simulator, denoted as Movie-Ext, to simulate domain adaption scenario. Movie-Ext has an additional payment task requiring the agent to converse with users to firstly book a ticket and then finish the payment. Details about the extended intent/slots can be found in the in appendix Table.3. All the agents are continually optimized from the previously trained agents for the movie ticket booking task. Meanwhile, we additionally collect a small number of human demonstrations to update the IM agent. Figure 4 shows the learning curves of different agents on the extended task. As we can see, both S 2 Agent and S 2 Agent w/o rs can quickly adapt to the new environment and outperform the IM agent, with only 150 epochs it achieves around 50% success rate. Though DQfD explicitly leverages human demonstrations, it still lags behind w/o rs, showing that shaping in the policy space is more effective than solely adding supervised learning loss for Q-learning. Reward shaping also benefits DQN to explore better policy. These observations confirm that S 2 Agent with shaping mechanism is capable of quickly adapting to the new environment.

Human Evaluation
User simulators are not necessary to reflect the complexity of human users (Dhingra et al., 2017). To further evaluate the feasibility of S 2 Agent in real scenarios, We deploy the agents in Table 1 to interact with real human users in Movie and Movie-Ext domains 5 . All evaluated agents are trained with 50 epochs and 200 epochs for Movie and Movie-Ext respectively. In each dialogue session, one of the agents is randomly selected to converse with a human user. Each user is assigned with a goal sampled from the corpus and is instructed to converse with the agent to complete the task. Users have the choice of terminating the task and ending the session at any time if users believe that the dialogue is unlikely to succeed or simply because the agent repeats for several turns. In such a case, the session is considered as a failure. Finally, at the end of each session, users are required to give explicit feedback on whether the dialogue succeeded (i.e., whether the movie tickets were booked (and paid) with all the user constraints satisfied). Additionally, users are requested to rate the session on a scale from 1 to 5 about the quality/naturalness (5 is the best, 1 is the worst). We collect 50 dialogue sessions for each agent. The results are listed in Table 2. S 2 Agent and S 2 Agent w/o rs perform consistently better than DQN and DQfD, which is consistent with what we have observed in simulation evaluation. In addition, S 2 Agent achieves the best performance in terms of success rate and user rating.

Conclusion
In this paper, we present a new strategy for learning dialogue policy with human demonstrations. Compared with previous work, our proposed S 2 Agent is capable of learning in a more efficient manner. By using policy shaping and reward shaping, S 2 Agent can leverage knowledge distilled from the demonstrations to calibrate actions from underlying RL agents for better trajectories, and obtains extra rewards for these state-actions similar to demonstrations alleviating reward sparsity for better exploration. The results of simulation and human evaluation show that our proposed agent is efficient and effective in both single domain and a challenging domain adaptation setting.   Table 3 lists all annotated dialogue acts and slots in details. Table 4 lists the training results of Imitation Model on all dataset.