Agent-Aware Dropout DQN for Safe and Efficient On-line Dialogue Policy Learning

Hand-crafted rules and reinforcement learning (RL) are two popular choices to obtain dialogue policy. The rule-based policy is often reliable within predefined scope but not self-adaptable, whereas RL is evolvable with data but often suffers from a bad initial performance. We employ a companion learning framework to integrate the two approaches for on-line dialogue policy learning, in which a pre-defined rule-based policy acts as a “teacher” and guides a data-driven RL system by giving example actions as well as additional rewards. A novel agent-aware dropout Deep Q-Network (AAD-DQN) is proposed to address the problem of when to consult the teacher and how to learn from the teacher’s experiences. AAD-DQN, as a data-driven student policy, provides (1) two separate experience memories for student and teacher, (2) an uncertainty estimated by dropout to control the timing of consultation and learning. Simulation experiments showed that the proposed approach can significantly improve both safety and efficiency of on-line policy optimization compared to other companion learning approaches as well as supervised pre-training using static dialogue corpus.


Introduction
A task-oriented spoken dialogue system (SDS) is a system that can continuously interact with a human to accomplish a predefined task through speech. Dialogue manager, which maintains the dialogue state and decides how to respond, is the core of an SDS. In this paper, we focus on the dialogue policy.
At the early research, the spoken dialogue systems assume observable dialogue states. Dialogue policy is simply a set of hand-crafted mapping rules from state to machine action. This is referred to as rule-based policy, which often has acceptable performance but has no ability of self-adaption. Nowadays rule-based policy is popular in commercial dialogue systems.
However, in real world scenarios, unpredictable user behavior, inevitable automatic speech recognition, and spoken language understanding errors make it difficult to maintain the true dialogue state and make the decision. Hence, in recent years, there is a research trend towards statistical dialogue management. A well-founded theory for this is the partially observable Markov decision process (POMDP) (Kaelbling et al., 1998), which can provide robustness to errors from the input module and automatic policy optimization by reinforcement learning. Most POMDP based policy learning research is usually carried out using either user simulator or employed users (Williams and Young, 2007;. The trained policy is not guaranteed to work well in real world scenarios. Therefore, on-line policy training has been of great interest (Gašić et al., 2011). Recently, Chen et al. (2017) proposed two qualitative metrics 1 to measure on-line policy learning: safety and efficiency. Safety reflects whether the initial policy can satisfy the quality-of-service requirement in real-world scenarios during the online policy learning period. Efficiency reflects how long it takes for the on-line policy training algorithm to reach a satisfactory performance level.
Most traditional RL-based policy training suf-fers poor initial performance, i.e. causes the safety problem. In light of above, Chen et al. (2017) proposed a safe and efficient on-line policy optimization framework, i.e. companion teaching (CT), in which a human teacher is added in the classic POMDP. The teacher has two missions: one is to show example actions, another is to act as a critic to give the student extra reward which can make the learning of policy more efficient. The example actions not only make the learning safer but also can be directly used by the training of the student policy. However, there are costs to the teaching of a human teacher. Based on CT, companion learning (CL) framework is proposed to integrate rule-based policy and RL-based policy, resulting in safe and efficient on-line policy learning. Here, the rule-based policy acts as a virtual teacher which replaces the human teacher in CT. There are a few differences between these two kinds of teachers. First, because it has no marginal cost when it's deployed, the rule teacher can be consulted at any time if needed. On the other hand, the rule policy is not as good as the human teacher, therefore it's important to determine when and how much the student policy depends on the rule teacher. Here, we propose an agent-aware dropout Deep Q-Network (AAD-DQN) as the student statistical policy, which provides (1) two separate experience replay pools for student and teacher, (2) an uncertainty estimated by dropout which can be used to control the timing of consultation and learning.
In summary, our main contributions are threefolds: (1) Companion learning (CL) framework was proposed to integrate rule-based policy and RL-based policy. (2) An agent-aware dropout Deep Q-Network (AAD-DQN) was proposed as the statistical student policy.
(3) Compared with other companion teaching approaches (Chen et al., 2017) as well as supervised pre-training using static dialogue corpus (Fatemi et al., 2016), CL with AAD-DQN can achieve better performance.

Related Work
Most previous studies of on-line policy learning have been focused on the efficiency issue, such as Gaussian Process Reinforcement Learning (GPRL) . In GPRL, the kernel function defines prior correlations of the objective function given different belief states, which can significantly speed up the policy learning (Gašić and Young, 2014). Alternative methods include Kalman temporal difference reinforcement learning (Pietquin et al., 2011).
More recently, deep reinforcement learning (DRL) (Mnih et al., 2015) is applied in dialogue policy optimization, including deep Q-Network (DQN) (Cuayáhuitl et al., 2015;Fatemi et al., 2016;Zhao and Eskenazi, 2016;Lipton et al., 2016) and policy gradient (PG) methods, e.g. RE-INFORCE (Williams and Zweig, 2016;Su et al., 2016;Williams et al., 2017), Advantage Actor-Critic (A2C) (Fatemi et al., 2016). In order to speed up the learning of DQN, Lipton et al. (2016) proposed an efficient exploration technique based on Thompson sample from a Bayesian neural network. Furthermore, they showed that using a few successful dialogues generated by a rulebased policy to pre-fill the replay buffer can benefit the learning at the beginning. To improve the efficiency of PG methods, policy network is initialized with supervised learning (SL) before RL training (Williams and Zweig, 2016;Williams et al., 2017;Su et al., 2016Su et al., , 2017Fatemi et al., 2016), which is similar to the idea in (Silver et al., 2016). However, combining RL with SL for dialogue policy optimization is not new. Henderson et al. (2008) were among the first to prove the benefits of combining supervised and reinforcement learning. In the experiments, we will compare CL with these pre-training methods.
Although the improvement of efficiency can benefit the safety of learning process, no matter how efficient the algorithm is, an unsafe on-line learned policy can lead to bad user experience at the beginning of learning period and consequently fail to attract sufficient real users to continuously improve the policy. Therefore, it is important to address the safety issue. There are few works about the safety issue of on-line dialogue policy optimization. Williams (2008) proposed a method for integrating business rules and POMDPs. The rules act as the action mask, i.e. the rules nominate a set of one or more actions, and the POMDP chooses the optimal action.

Companion Learning for On-line Policy Optimization
In the CL framework, there are two agents: one is the student policy, another is the teacher policy. Here, teacher policy is the extra part com- pared with the classic statistical dialogue manager architecture (Young et al., 2013). The goal of online policy training is to optimize the student policy from data via interaction with users in real scenarios. The teacher guides the policy learning at each turn as a companion of the dialogue policy, hence, referred to as companion learning 2 . The CL framework is described in Figure 1(a). At each turn, the input module (ASR and SLU) receives an acoustic input signal from the human user and the dialogue state tracker keeps the dialogue state up-to-date. The dialogue state is then transmitted to both the student policy and the teacher policy. The student policy first generates a candidate action a stu t and when it needs help from the teacher policy, it sends a stu t with some auxiliary information which will be transmitted to the teacher. The teacher policy can then help the student policy with one of the following ways or both: • Example Action (EA): The teacher generates an action a tea t instead of a stu t according to its policy. It corresponds to the left switch in Figure 1(a).
• Critic Advice (CA): The teacher will not explicitly show an action. Instead, it gives an extra reward r int t to the student policy. It corresponds to the right switch in Figure 1(a).
The action from control module is then transmitted to the output module, which generates the nature text and audio. At each turn, an extrinsic reward signal r ext t will be given to the student policy by 2 The name companion learning has another potential meaning that the agents can learn from each other, i.e. the rules guide the RL training, and the optimised RL policy can provide some intuition for the revision of rules. We will give some preliminary discussions about this point in section 5.3. the environment, i.e. the user. The extrinsic reward r ext t with the extra intrinsic reward r int t will be used to update the policy parameters θ using reinforcement learning algorithms.
In the CL framework, there are two things that matter: one is when to consult the teacher, another is how to use the teacher's experiences. In this paper, an agent-aware dropout DQN (AAD-DQN) is proposed. As shown in Figure 1(b), the certainty information during the interaction is used to define a companion function, which controls how often to sample the teacher's experiences for updating parameters during the training phase (left), and when to use EA or CA teaching method during decision phase (right).
The rest of this section is organized as follows. The next subsection introduces the agentaware experience replay in DQN. The definition of certainty in DQN and the companion function are presented in subsection 3.3. The rule-based teacher policy is described in subsection 3.4.

Agent-Aware Experience Replay in DQN
A Deep Q-Network (DQN) is a multi-layer neural network which maps a belief state b t to the Q values of the possible actions a t at that state, Q(b t , a t ; θ), where θ is the weight vector of the neural network. Neural networks for the approximation of value functions have long been investigated (Lin, 1993). However, these methods were previously quite unstable (Mnih et al., 2013). In DQN, Mnih et al. (2013Mnih et al. ( , 2015 proposed two techniques to overcome this instability, namely experience replay and the use of a target network. At every turn, the transition including the previous belief state b t , previous action a t , corresponding reward r t and current belief state b t+1 is put in a finite pool (Lin, 1993). In this pa-per, two pools D stu and D tea are used to store the student's experiences and the teacher's experiences respectively as shown in Figure 1(b). When the teaching method EA is used in the t-th turn, a t = a tea t and the transition is put in D tea , otherwise a t = a stu t and the transition is put in D stu . When CA is used, r t = r ext t + r int t , otherwise r t = r ext t . Once any of the pool has reached its predefined maximum size, adding a new transition results in deleting the oldest transition in the pool. During training, a pool is first selected from D tea and D stu . The probability of selecting D tea is p tea , i.e. D ∼ Ber(D tea , D stu ; p tea ) 3 .Then a minibatch of transitions is uniformly sampled from the selected pool, i.e. (b t , a t , r t , b t+1 ) ∼ U (D). We call this agent-aware experience replay.
Except for the experience replay, a target network with weight vector θ − is used. This target network is similar to the Q-network except that its weights are only copied every K steps from the Q-network, and remain fixed during all the other steps. The loss function for the Q-network at each iteration takes the following form: The probability p tea controls how often the student learns from the teacher's experiences. As the learning goes on, the probability will decrease. More details will be described in the next section.

Companion Strategy
It's important for the student to estimate an appropriate point to end the reliance on the teacher. If the reliance is ended too early, the student itself may not reach an acceptable performance, resulting in the sharp drop of performance, which is the safety problem. However, if the student always relies on the teacher, it's hard to improve its performance to surpass the teacher's performance, which is the efficiency problem.
We get some inspirations from the studying process of a call center service agent. Consider how a new call center service agent gets started. At first, an experienced call center agent tells him some basic rules and the new agent works by often consulting these rules. His confidence about 3 Ber is short for Bernoulli. how to make decisions gradually increases during the continuous practice. Eventually, he is so confident about his own decisions that he no longer needs any consultation to these rules and even explores some better response ways through interaction with users which are not initially included in the rules. Similarly, we can use the uncertainty/certainty of the Q-network to determine the teaching time.
There are several methods to estimate the uncertainty/certainty in deep neural networks, e.g. Bayesian neural networks (Blundell et al., 2015), dropout (Gal and Ghahramani, 2016), bootstrap (Osband et al., 2016) . Here we use the dropout to estimate the certainty of Q-Network. We call this Q-network DropoutQNetwork. Dropout is a technique used to avoid over-fitting in neural networks. It was introduced several years ago by (Hinton et al., 2012) and studied more extensively in (Srivastava et al., 2014). When dropout is used in training, the elements of the output of each hidden layer h is randomly set to zero with probability p, i.e. h = h z 4 where z is binary vector and each element z i ∼ Ber(1 − p). h is scaled by 1 1−p and then fed to the next layer. At test time the dropout is disabled, i.e. the output of each hidden layer h is directly fed to the next layer. Although dropout was suggested as an ad-hoc technique, recently it was theoretically proven that the dropout training in deep neural networks is an approximate Bayesian inference in deep Gaussian processes (Gal and Ghahramani, 2016). Therefore, a direct result of this theory gives us tools to model uncertainty with dropout neural networks. To obtain the uncertainty, similar with that at train phrase the dropout is enabled at test phrase. For each input instance (i.e. dialogue belief state) b t , performing N stochastic forward passes through the network and averaging the output q i [q i1 , · · · , q iM ] to get the mean and the variance. Generally, the variance can be utilized to measure the uncertainty of output. However, it's not a normalized criteria, and it's hard to set a threshold below which we should be confident with the output.
Instead, we proposed a novel method to measure the certainty of the decision of student policy at t-th turn. For each stochastic forward passes, the action a ti = arg max j q ij is regarded as a vote. After N passes 5 , there is a committee {a t1 , · · · , a tN } consisting of N votes. The action a stu t that should be taken in the belief state b t is the one with the largest percentage of the votes, and the corresponding percentage is defined as certainty c t . The process is described in Algorithm 1.
Algorithm 1 The Decision Procedure of Student Policy π stu (b t , N ) Require: The repeat times N and the belief state b t 1: Initial the probability vector p = [p 1 , · · · , p M ] with zero vector, where M is the number of actions. 2: for i = 1, N do 3: At the end of e-th dialogue, the average certainty of all turns is computed, i.e. C e = 1 Te Te t=0 c t , where T e is the number of turns in e-th dialogue. Generally, the variance of C e between successive dialogues is high. In order to the smooth the estimation, here we use the moving average of C e in previous W dialogues to represent the certainty of student at current dialogue, i.e.
As the training goes on, C e grows until it converges. If C e in all successive W dialogues are greater then a threshold C th as shown in Figure  2, it's assumed that the student reaches a point where it is confident enough with its own decision steadily. Therefore, the teaching, both EA and CA, should be ended from now on. Before the end of the teaching, CA is done in all turns. However, if EA is always done, the disappearance of the teacher may cause a dramatic change in the hybrid decision policy, which results in a sharp drop of performance. To deal with this issue, a monotonically increasing function of the relative certainty P tea (∆C e ) is proposed to control the frequency of EA teaching. dialogue state can be repeated N times to form a mini-batch, then one forward is executed to get N outputs simultaneously. ∆C e represents the distance between C e and C th , i.e. ∆C e = max(0, C th − C e ). The effect of P tea (∆C e ) is that the closer C e is to C th , the more unlikely EA teaching is executed. Besides controlling how often the student directly consult the teacher, another mission of P tea (∆C e ) is to control how often the teacher's experiences are replayed, i.e. the probability p tea described in section 3.2. Implementation details of P tea (∆C e ) are described in Appendix C. The full procedure of companion learning with logic rules is described in Algorithm 2.

Teacher Policy: Logic Rules
Rule-based policy is popular in commercial dialogue systems (Williams, 2008). The policy, i.e. the dialogue plan/flow, is designed by a domain expert. His knowledge of task domain and business rules is encoded in the rules. There are many methods to represent the decision rules, e.g. propositional logic, first-order logic, decision tree. Here, we use the ordered propositional logic rules, which can be easily translated into IF-THEN rules. When making the decision, these rules are executed in pre-defined order. If the conditions of any rule are satisfied, the decision process will be terminated and the output is the corresponding action. In this paper, three hand-crafted logic rules, R1, R2, and R3 , were used as the teacher: • R1: confirm the most likely value in slots where the most likely value has probability between 0.1 and 0.6 6 ; • R2: offer a restaurant if there is at least one slot in which the belief of most likely value is more than the belief of special value "none"; for t = 0, T e do 10: Set intrinsic reward r int t ← 0 11: Get system action and the corresponding certainty, i.e. a stu t , c t ← π stu (b t , N )

12:
C e ← C e + c t

13:
Get action from the rule-based policy, i.e. a tea t ← π tea (b t ) if teaching is T rue then 21: Te C e , and store C e in C 24: Give the action a t to the environment, observe the extrinsic reward r ext t and update the dialogue belief state b t+1 25: if EA is T rue then 27: p tea ← P tea (∆C e ) 12: end if 13: return teaching, p tea • R3: request values for a slot which is uniformly selected from a pre-defined slot list.
The corresponding pseudo-codes are presented in Appendix B.

Evaluation Metrics of On-line Policy Optimization
Most previous work on the evaluation of RL-based dialogue policy optimization focuses on the final performance (FP) when the system converges to a steady level. However, for on-line policy optimization, it's important to measure the learning process. Except for FP, we proposed two quantitative metrics: safety loss and efficiency loss.

Safety Loss
In the on-line training process, unless the performance of the system reaches the acceptable performance S a , the interaction between users and the system will be unsafe and causes trouble to continuing training. So the safety of the system is defined to be the system's ability to maintain performance above the acceptable performance S a . We quantify the safety loss of the system by summing up the performance gap between the acceptable performance and the system performance S e in every episode during the on-line learning. Suppose there are E dialogues, then The safety loss has an intuitive interpretation as the area of the region below the threshold and above training curve. This metric is similar to the integral of absolute error (IAE) (Shinners, 1998) metric commonly adopted in the evaluation of control systems (Gaing, 2004;Jesus and Tenreiro MacHado, 2008).

Efficiency Loss
Another important issue of on-line learning is efficiency. The efficiency indicates the speed at which the system reaches a specific performance level. In reality, we can tolerate a system to make mistakes at the beginning but it should improve at a significant speed until reaching the ideal performance S i . Therefore, later failures should weight more than early failures to evaluate efficiency. Similar to the integral of time multiplied by absolute error (ITAE) (Shinners, 1998) metric, we propose a metric efficiency loss. We multiply the performance gap between ideal performance and current performance with the episode index, thus giving later failure greater penalty. Specifically, More illustrations about safety loss and efficiency loss are given in Appendix D.

Experiments
Our experiments have three objectives: (1) Comparing our proposed dropout DQN in Algorithm 1 with some baselines when there is no teacher. (2) Comparing CL with other two baselines when the teacher gets involved, and investigating the benefits of our proposed agent-aware experience replay. (3) Visually analyzing the differences in behaviors between the rule-based teacher policy and the optimized student policy.
An agenda-based user simulator (Schatzmann et al., 2007a) with error model (Schatzmann et al., 2007b) was implemented to emulate the behavior of the human user, and a rule-based policy with 0.695 success rate described in section 3.2 was used as the teacher in our experiments. The purpose of the user's interacting with SDS is to find restaurant information in the Cambridge (UK) area (Henderson and Thomson, 2014). This domain has 7 slots of which 4 can be used by the system to constrain the database search. The summary action space consists of 16 summary actions. More details are described in Appendix A.
For reward, at each turn, an extrinsic reward of −0.05 is given to the student policy. At the end of the dialogue, a reward of +1 is given for dialogue success. The maximal extra reward δ is 0.05. For each set-up, 10000 dialogues are used for training, the moving dialogue success rate is recorded with a window size of 1000. The final results are the average of 40 runs.

Policy Learning without Teaching
In this section, four policies without teaching are compared: • DQN: A vanilla deep Q-Network (Mnih et al., 2015) which has two hidden layers, each with 128 nodes.
• A2C: An advantage actor-critic policy which consists of an actor network and a critic network (Fatemi et al., 2016).
• Dropout DQN 1 and Dropout DQN 32: They both have a dropout layer after each hidden layer. The dropout rate is 0.2. Their difference is that the number of stochastic forward pass N of Dropout DQN 32 in Algorithm 1 is 32, while that of Dropout DQN 1 is 1. Dropout DQN 1 makes decision according to one output of Q-network similar to that of vanilla DQN. Dropout DQN 1 was first proposed in (Gal and Ghahramani, 2016), and was confirmed that Dropout DQN 1 can obtain more efficient exploration. The learning curves are described in Figure 3 and the evaluation results are described in Table  1  dropout can be observed as claimed in (Gal and Ghahramani, 2016). However, Dropout DQN 1 seems to suffer premature and sub-optimal convergence, while our proposed Dropout DQN 32, whose decision is based on multi votes (algorithm 1), can result in improvement of efficiency and better final performance. Moreover, Dropout DQN 32 also performs much better than the policy gradient method A2C. For the following experiments, the times of stochastic forward pass N in Algorithm 1 is 32.

Policy Learning with Teaching
In this section, four methods of teaching by the rule-based policy are compared: • EA: 500 dialogues are taught with EA at the beginning (Chen et al., 2017).
• A2C PreTrain: At the beginning, 500 dialogue are collected with rule-based policy. These examples are used to pre-train the actor network with supervised learning. After the pre-training, the policy is continuously optimized with the A2C algorithm (Fatemi et al., 2016).
• CL AAD: Full CL with AAD-DQN described in section 3.
• CL D: CL without agent-aware experience repay, i.e. the teacher's experiences and student's experiences are put in one pool and are uniformly sampled for the experience replay in equation (1).
As can be seen in Figure 4, there is a big dip in the performance of A2C PreTrain. One possible explanation is that because the rule-based policy is sub-optimal, the pre-training makes the student policy reach a local minimum point. The rltraining should first make it escape from the local Figure 4: Comparison of four methods with teaching by rule-based teacher. minimum point, which results in a temporary loss in performance.
Comparing CL methods (CL D and CL AAD) with EA in Figure 4 and in Table 1, we can conclude that CL can significantly boost the safety of learning process. Moreover, except for safety, CL AAD can boost the efficiency, which benefits from the agent-aware experience replay.

Comparison of Optimized Student Policy and Rule-based Teacher Policy
To interpret what the student has learnt, we further compare the rules and an optimized student policy with 76.7% success rate. The rule-based policy is used to collect 5000 dialogues, while in each turn the decision made by the student policy is also recorded. Figure 5 is a confusion matrix. The x-axis denotes the student's decision and the y-axis denotes the rules' decision. The numbers on the left are the statistics for each action in 5000 dialogues. Each element in the matrix denotes the normalized number of turns when the rule chooses the action in the corresponding line, the student chooses the action in the corresponding column. As is shown in Figure 5, offer and confirm are two action types used most frequently. In more than half of turns when the rule-based pol- Figure 5: Confusion matrix between the decisions of rule-based policy and the decisions of the optimised student policy. The x-axis denotes the student's decision and the y-axis denotes the rules' decision.
icy chooses offer 1, the student policy will choose a different action. Furthermore, from the element in line offer 1 and column request area, we can find that in this situation the student policy prefers the action request area. Inspired by this disagreement, we designed a new rule: • R4: request values for slot area when there is only one other slot constraint for the database query.
Similarly, as can be seen in Figure 5, in a considerable proportion of turns when the rule-based policy chooses confirm area, confirm pricerange, or confirm name, the student policy will choose the action offer 2, which may mean that for slots area, pricerange, or name, when there are values for database query, the system should offer a restaurant instead of confirming the slot-value constraints. Therefore, the rule R1 in section 3.2 was revised as follows: • R1*: For slot food, confirm the most likely value has the probability between 0.1 and 0.6; For slot area, pricerange and name, confirm the most likely value, the belief of which is smaller than the belief of the special value "none" and is larger than 0.1.  Table 2: Evaluation results of different ordered rules. As a reference, the performance of optimised student policy is success rate 0.767, #turn 5.10 and reward 0.5124.
while the rule R1* can both boost the success rate and decrease the dialogue length (comparing line 3 with line 1). The combination of R4 and R1* takes respective advantages (comparing line 4 with line 1, line 2 and line 3). The performance of final order rules is comparable to the performance of optimized student policy. It is worth noting that the primary rules R1, R2, and R3 in section 3.2 don't distinguish between different slots. However, the new rules R4 and R1* are all slot-specific, which it is difficult to design at the beginning.

Conclusion
This paper has proposed a companion learning framework to unify rule-based policy and RLbased policy. Here, the rule-based policy acts as a teacher, which either directly shows example action or gives an extra reward. Based on the uncertainty estimated using a dropout Q-Network, a companion strategy is proposed to control when the student policy directly consults rules and how often the student policy learns from the teacher's experiences. Simulation experiments showed that our proposed framework can significantly improve both safety and efficiency of on-line policy optimization. Additionally, we visually analyzed the differences in behaviors between the rule-based teacher policy and the optimized student policy, which gave us some inspirations to refine the rules.