Affordable On-line Dialogue Policy Learning

The key to building an evolvable dialogue system in real-world scenarios is to ensure an affordable on-line dialogue policy learning, which requires the on-line learning process to be safe, efficient and economical. But in reality, due to the scarcity of real interaction data, the dialogue system usually grows slowly. Besides, the poor initial dialogue policy easily leads to bad user experience and incurs a failure of attracting users to contribute training data, so that the learning process is unsustainable. To accurately depict this, two quantitative metrics are proposed to assess safety and efficiency issues. For solving the unsustainable learning problem, we proposed a complete companion teaching framework incorporating the guidance from the human teacher. Since the human teaching is expensive, we compared various teaching schemes answering the question how and when to teach, to economically utilize teaching budget, so that make the online learning process affordable.


Introduction
A task-oriented dialogue system is designed for interacting with humans users to accomplish several predefined domains or tasks (Young et al., 2013;Daubigney et al., 2012). Dialogue Manager is the core component in a typical dialogue system, which controls the flow of dialogue by a state tracker and a policy module (Levin et al., 1997). The state tracker tracks the internal state of the system while the policy module decides the response to the user according to the status of states (Sun et al., 2014a; Thomson and * Both authors contributed equally to this work. . The approaches of constructing a policy module can be divided into two categories: rule-based and statistical. Rule-based policies are usually hand-crafted by domain experts which means they are inconvenient and difficult to be optimized (Williams and Wang and Lemon, 2013). In recent mainstream statistical studies, Partially Observable Markov Decision Process (POMDP) framework has been applied to model dialogue management with unobservable states, where policy training can be formulated as a Reinforcement Learning (RL) problem, which enables the policy to be optimized automatically (Kaelbling et al., 1998;Arnold, 1998;Young et al., 2013).
Though RL-based approaches have the potential to improve themselves as they interact more with human users and achieve better performance than rule-based approaches, they are rarely used in real-world applications, especially in on-line scenarios, since the training process is unsustainable. The main causes of unsustainable on-line dialogue policy learning are two-fold: • Safety issue: the initial policy trained from scratch may lead to terrible user experience at the early training period, thus fail to attract sufficient users for more dialogues to do further policy training.
• Efficiency issue: if the progress of policy learning is not so efficient, it will exhaust users' patience before the policy reaches a desirable performance level.
Prior works have mainly focused on improving efficiency, such as Gaussian Processes RL (Gašić et al., 2010), deep RL (Fatemi et al., 2016), etc. For deep RL approaches, recent researches on the student-teacher RL framework have shown prominent acceleration to policy learning process (Torrey and Taylor, 2013;Williams and Zweig, 2016;Amir et al., 2016). In such framework, the teacher agent instructs the student agent by providing suggestions on what actions should be taken next (Clouse, 1996).
For the safety issue, Chen et al. (2017) developed several teaching strategies answering "how" the human teacher guide the learning process.
However, those previous teaching methods exclude "when" to teach from concern. They simply exhaust all the budget continuously from the beginning, which is wasteful and causes a heavy workload of the human teacher. An affordable dialogue policy learning with human teaching should require a lighter workload and economically utilize teaching budget.
Furthermore, as for safety and efficiency evaluation, previous works have been observing the training curves and testing curves to tell which one is better, or evaluate policy performance after certain dialogues of training, which are subjective and error prone (Chen et al., 2015a;Chen et al., 2017).
Our contribution is to address the above problems. We propose a complete framework of companion teaching, and develop various teaching schemes which combine different teaching strategies and teaching heuristics together, to answer the questions of "how" and "when" to teach to achieve affordable dialogue policy learning (section 2). Specifically, a novel failure prognosis based teaching heuristic is proposed, where Mul-tiTask Learning (MTL) is utilized to predict the dialogue success reward (section 3). To avoid the drawbacks of traditional subjective measurements, we propose two evaluation metrics, called Risk Index (RI) and Hitting Time (HT), to quantify the safety and efficiency of on-line policy learning respectively (section 4). Simulation experiments showed, with the proposed companion teaching schemes, sustainable and affordable on-line dialogue policy learning has been achieved (section 5).

Companion Teaching Framework
The companion teaching framework is an on-line policy training framework with three intelligent participants: machine dialogue manager, human user, and human teacher (Chen et al., 2017). Under this framework, the human teacher is able to accompany the dialogue manager to guide policy learning with a limited teaching budget. By investigating the real work mode in a call center, this framework makes a reasonable assumption that human teacher has access to the extracted dialogue states from the dialogue state tracker as well as the system's dialogue act, and can also reply in the same format.
However, there are two major problems in the previous framework. First, the system will judge whether a dialogue session succeeds by several simple rules and then determine whether to feed a success reward signal to dialogue manager for reinforcement learning. Actually, the success feedback made by the system lacks flexibility and credibility, and it could mislead the policy learning. A more suitable judge should be the user or the human teacher. Second, the previous framework only answers in which way the human teacher can guide the online dialogue policy learning, but another essential question, when should the human teacher give guidance, remains undiscussed. In this paper, we proposed a complete framework of companion teaching, depicted as Figure  1. At each turn, the input module receives a speech input from the human user, then produces possible utterances u t of the speech in text. After that, the dialogue state tracker extracts the dialogue state s t from possible utterances. This dialogue state will be shared with policy model and human teacher if needed. When the final response a t , has been determined, the output module will translate this dialogue act to the natural language and reply to the human user. The success signal will be fed back by the user or the human teacher as an important part of system reward, at the end of each session. The human teacher can take the initiative or be activated by student initiated heuristic to give the dialogue guidance with strategies corresponding to different configurations of switches in the illustration. We call the combination of strategy and heuristic as teaching scheme.

Teaching Strategies
The teacher can choose among three teaching strategies corresponding to different configurations of switches in a wiring diagram as Figure 1 shows: The left switch is a Single-Pole, Double-Throw (SPDT) switch, which controls whether the answer is made by the system (connected to 1) or given by the teacher as an example (connected to 2). The right switch is a simple on-off switch, which represents whether there is an extra reward signal from the teacher (ON) or not (OFF). The strategy related to the right switch is called teaching via Critic Advice (CA), also known as turn-level reward shaping (Thomaz and Breazeal, 2008;Judah et al., 2010). When the switch at position 3 is turned on, the teacher will give the policy model extra turn-level reward to distinguish the student's actions between good and bad actions. Besides, the left switch corresponds to teaching via Example Action (EA), which means the teacher gives example action for the student to take according to the student's state.
The other strategy is proposed by Chen et al. (2017), which take the advantages of both EA and CA, named teaching via Example Action with Predicted Critique (EAPC). With this strategy, the human teacher gives example actions, meanwhile, a weak action predictor is trained using this teaching information to provide the extra reward even in teacher's absence.

Teaching Heuristics
The strategies only answer how the human teacher can offer companion teaching to the system. However, the timing of teaching should not be ignored for the sake of utilizing the limited teaching budget better. Exhausting all the budget at early training stage, named Early teaching heuristic (Early), is simple and straightforward but wastes teaching opportunities on unnecessary cases. Thus, it is im-perative to design some effective heuristics to instruct when the teacher should give a hand to the student.
In addition to early teaching, the teaching heuristics can be broadly divided into two categories: teacher-initiated heuristics and studentinitiated heuristics (Amir et al., 2016). However, the teacher-initiated approaches require the constant long-term attention of the teacher to monitor the dialogue process (Torrey and Taylor, 2013;Amir et al., 2016), which is costly and impractical for real applications. Therefore, in this paper, we only discuss student-initiated heuristics, shown as the line with a stopwatch in Figure 1, which means that the student agent decides when to ask for the teacher's help.
Previous works have presented several effective heuristics based on state importance, I(s), which is determined by the Q-values of the RL agent: Torrey and Taylor (2013) proposed State Importance based Teaching heuristic (SIT) which make the student ask for advice only when the current state is important: where t si is a fixed threshold for importance. And Clouse (1996) proposed an State Uncertainty based Teaching heuristic (SUT) which ask for advice when the student is uncertain about which action to take: where t su is a given threshold for uncertainty. Though teaching effort can be conserved by only applying to those important or uncertain states, it may end up wasting advice if the dialogue is likely to be successful without teaching. In this paper, we propose a novel Failure Prognosis based Teaching heuristic (FPT) for on-line policy learning to reduce that unnecessary advice. The details are given in section 3. For comparison, we will also investigate Random teaching heuristic (Rand) which means the student seek for advice with a fixed probability p r .
to predict whether the ongoing dialogue will end in success and ask for advice only when the current prediction is a failure. The proposed approach utilizes MultiTask Learning (MTL) for the policy model to estimate future dialogue success reward and is compatible with various RL algorithms. In this paper, we implement the policy model with a Deep Q-Network (DQN), in which a neural network function approximator, named Q-network, is used to estimate the action-value function (Mnih et al., 2013).

Multitask Deep Q-Network
The goal of the policy model is to interact with human user by choosing actions in each turn to maximize future rewards. We define the dialogue state shared by dialogue state tracker in the t-th turn as s t , the action taken by policy model under current policy π θ with parameters θ in the t-th turn as a t , and a t ∼ π θ (·|s t ). In an ideal dialogue environment, once the policy model emit an action a t , the human user will give an explicit feedback, like a normal response or a feedback of whether the dialogue is successful, which will be converted to a reward signal r t delivering to the policy model immediately, and then the policy model will transit to next state s t+1 . The reward r t is composed of two parts: where r turn t is the turn penalty reward and r succ t is the dialogue success reward. Typically, r turn t is fixed for each turn as a negative constant R turn , while r succ t equals to a predefined positive constant R succ only when the dialogue terminates and receives a successful user feedback otherwise zero.
In DQN algorithm, all these transitions (s t , a t , r t , s t+1 ) will be stored in a replay memory D. And the objective is to optimize MSE between Q-network Q(s, a; θ) and Q-learning target Q e . The loss function L(θ) is defined as: During the training period, Q e is estimated with old fixed parameter θ − and sampled transitions e ∼ D: where γ is the discount factor.
The reward Q(s, a) estimated by original Qlearning algorithm is essentially a combination of future turn penalty reward Q turn (s, a) and future dialogue success reward Q succ (s, a). For a task-oriented dialogue system, the prediction of Q succ (s, a) is much more important because it reflects the possibility of the dialogue to be successful. If these two rewards are estimated separately, the objective of Q succ (s, a) can be optimized explicitly, and we can get more insights into the estimated future. We found that in practice, optimizing these two objectives with MultiTask Learning (MTL) converges faster and more stable compared with two separate models, the reason of which may lie in that MTL can learn different related tasks in parallel using shared representations, which will be helpful for each task to be learned better (Caruana, 1997). The structure of proposed MTL-DQN is depicted in Figure 2.

Failure Prognosis
In the proposed multitask DQN, we define on-line task success predictor T (s t ) as: where a t is the action taken under state s t . It is reasonable to assume that the dialogue is going to fail if T (s t ) is relatively small. Based on the task success predictor, we propose a novel studentinitiated heuristic, named Failure Prognosis based Teaching heuristic (FPT).
The key to the proposed heuristic is to define failure prognosis quantitatively. A straight way is to set a ratio threshold α, and consider it to be failure prognosis when T (s t ) < αR succ . However, this assumes that the numerical scale of Q succ is consistent through the training period, which is not always the case. And the student's noisy estimation of Q succ at early training period will make the learning process unstable. To smooth the teaching, we consider using a turn-level sliding window near the current state to calculate an average value as the replacement of the fixed R succ . So in the tth turn, the failure prognosis for the student to be true is equivalent to: where w is the size of the sliding window.

Quantitative Measurements for Safety and Efficiency
The performance of different teaching strategies and heuristics should be measured in both the safety and efficiency dimension. However, the measurements in these two dimensions are subjective and error prone in the previous work (Chen et al., 2017). Especially for assessing the degree of safety of various teaching strategies and heuristics, we simply obverse the training curves so that we cannot tell of two interleaving curves which training process is safer. Thus, it is imperative to set up some quantitive measurements for both safety and efficiency evaluations. In this paper, we propose two scalar criteria: Risk Index (RI) and Hitting Time (HT).

Risk Index
The Risk Index is a nonnegative index designed to indicate how risky the training process could be for evaluating the safety issue during the whole online dialogue policy learning process. Because we expect that the system satisfies the quality-ofservice requirement in the early training period, specifically, we hope it can keep a relatively acceptable success rate. It is straightforward to set a success rate threshold for the training process. In a real application scenario, this threshold can be obtained by an appropriate user study. If the success rate over a training process keeps above this threshold all the time, we will think this training process is absolutely safe. Therefore, its RI should equal to zero.
On the other hand, if the success rate over a training process rises and falls and sometimes is below the threshold, it is risky. The riskiness consists of two parts: • Disruptiveness: Sometimes the success rate during a certain period will fall much lower than the threshold, which could be very disruptive. To quantify the disruptiveness, we consider the function over the training process. The higher the value of dis(t ) is, the riskier the training process could be during the period of a certain length centered with time t .
• Persistence: Another thing we should take into account is the duration of the time at high risk. Let δ risk (t) be the indicator of whether threshold ≥ %succ(t). Then the persistence can be quantified as total risky time The longer the danger persists over the training process, the value of persistence of the training process will be, and the riskier it is.
Our Risk Index integrate these two contents of riskiness. That is, a nonnegative scalar which measures the integrated riskiness for the online training process of total length T . The RI also has an intuitive interpretation as the area of the region which is below the threshold line and above training curve. Straightforwardly, high RI indicates poor safety.

Hitting Time
To measure the efficiency, we proposed the Hitting Time in order to show how fast the system learns and reaches the satisfactory performance. The difficulty of designing such a criterion lies in the dramatic and undamped fluctuation of the test curves, which is inherent in the instability dialogue task. Therefore, many popular criteria for the evaluating dynamic performance in control theory, such as "settling time" and "rise time", cannot be applied to evaluate efficiency here.
We use Hitting Time to evaluate efficiency over the fluctuant testing curve first by fitting it to the empirical learning curve where the parameter a is the stationary performance which is predicted as the asymptotic goal of the system, b relates to the initial performance, and c relates to the climbing speed. This empirical model forces our fitted learning curve be an "S" shape curve satisfying constraints f (0) = 0 and f (c/2) = 0. Then we observe when this fitted learning curve hits the target performance τ and this time (measured in sessions) is Hitting Time. It can be calculated analytically as follow Ideally, the ultimate success rate a should be very close under different settings because of the sufficient training. However, if the success rate keeps very poor during the given sessions, the fitted a will be very low, and even less than the target satisfactory performance τ . In this situation, a is meaningless, and HT becomes a complex number. And this indicates the real hitting time is far larger than given number of sessions T . We will note the HT in this case as ULT (Unacceptably Large Time).
In this way, we overcome the fluctuation and make the HT tell us how much time the system takes to hit and surpass the target success rate.

Experiments and Results
Three objectives are set for our experiments: (1) Observing the effect of multitask DQN; (2) Contrasting the performances of different teaching schemes (strategies and heuristics) under the companion teaching framework; (3) Observing the safety and efficiency issues under sparse user feedback scenarios.

Experimental Setup
Our experiments are conducted with the Dialogue State Tracking Challenge 2 (DSTC2) dataset, which is on restaurant information domain (Henderson and Thomson, 2014). The human user is emulated by an agenda-based user simulator with error model (Schatzmann et al., 2007), while the human teacher is emulated by a pre-trained policy model with success rate of about 0.78 through multitask DQN approach without teaching. A rule-based tracker is used for dialogue state tracking (Sun et al., 2014b). The semantic parser is implemented according to an SVM-based method proposed by Henderson et al. (2012). The natural language generator is implemented and modified based on an RNNLG toolkit (Wen et al., , 2015a.

Early
Rand SIT SUT FPT None pr = 0.6 tsi = 5 tsu = 10 α = 1.2 w = 25 In our experiments, all dialogues are limited to twenty turns. The "dialogue success" is judged by the user simulator according to whether all user goals are satisfied. And for policy learning, we set a small per-turn penalty of one to encourage short interactions, i.e. R turn = −1, and a large dialogue success reward of thirty to appeal to successful interactions, i.e. R succ = 30 , and the discount factor γ is set to one. Table 1 summarizes the heuristics studied in our experiments, together with corresponding configurations which are chosen empirically.

Observing the Effect of MTL-DQN
The MTL-DQN described in section 3.1 can estimate the prediction of Q turn and Q succ respectively. In our experiments, it was implemented with one shared hidden layer and two dependent hidden layers for two different tasks using MXNet (Chen et al., 2015b). Figure 3 shows a typical failure in dialogue policy training. The policy showed in the example hasn't been trained well, and it tends to ask the user to repeat over and over again when the confidence score of the user utterance is not high, which causes the user to terminate the dialogue impatiently. Figure 3: An example of failed dialogue while training without teaching. The labels "Score" and "FP" represent for the confidence score of user utterance and the value of failure prognosis of the current turn respectively. This kind of failure can be predicted and corrected in advance. By equation 5, the third turn will be estimated to be failure prognosis, which can be a sign for the teacher to intervene and correct the following actions to avoid dialogue failure. Besides, the explicit separate estimation of Q turn and Q succ provides a better understanding of the state of the current turn. For example, although the first turn and second turn have similar Q-values (Q turn +Q succ ), the latter turn is predicted with less future turns and less possibility to lead to dialogue success. See appendix A for additional successful example.

Comparing Different Teaching Schemes
Our proposed complete companion teaching framework allows us to teach dialogue systems with different teaching schemes, which consists various strategies and heuristics. In our experiments, we compared 18 schemes consisting of three teaching strategies (CA, EA and EAPC), and six teaching heuristics (Early, Rand, SIT, SUT, FP-T and SUT&FPT). The SUT&FPT heuristic means the student only ask for advice when equation 2 and 5 are both satisfied. For comparison, we use No Teaching (NoTeaching) as the baseline.
To verify the effects of different companion teaching schemes, we conduct a set of experiments to see their performances on safety and efficiency dimensions. During training, the teacher can only teach for a limited budget of 1000 turns. All the training curves shown in this paper are moving average curves with a window of size 250 dialogues and over eight runs with an endurable standard error.

Safety Evaluation
To compare effects of different teaching schemes on safety dimension, we use the Risk Index (RI) in section 4.1 to quantitatively measure each training process. We set the empirical safety threshold as 65% here. The results are shown in Table 2.
As RIs implies, schemes composed with EAPC strategy is much safer than those composed with other strategies. As for teaching heuristics, FP-T, SUT and SUT&FPT are three relatively safer heuristic accompanying different strategies. One exception is that Early teaching looks more suitable for CA. A possible explanation is that when the teacher gives critique earlier, the student will mind its behavior earlier so that increase safety. Figure 4 shows the training curves of on-line    Table 3 contains all HTs of learning process under 18 teaching schemes. Intuitively, The number in the table reflect the number of sessions at which the model achieves target success rate. As it shows, not any teaching scheme will improve the learning efficiency. If the teacher intervenes at an improper time, it will distract system or confuse system even with a right guidance. But teaching when a potential failure exists (F- PT) is always good for improving learning efficiency. EAPC+SUT&FPT is the teaching scheme that leads to the most efficient learning process in our experiments. Figure 6 gives some example test curves and fitted empirical learning curves of learning process under EAPC with various heuristics.

Teacher's Workload
We also observe teacher's workload of all the teaching schemes since economically utilizing teaching budget is one of our goals. The total teaching budget is 1000 for every teaching scheme. See abbreviations of schemes in section 2.1 and 2.2. Figure 5 illustrates the cumulative usage of teaching budget of 18 teaching schemes. It shows that early teaching is the most costly teaching heuristic so that the teaching budget is soon used up. SIT looks a bit lazy at the beginning and consumes teaching budget slowly. When the teaching strategy is EA or EAPC, FPT-based schemes do not use up full teaching budget in our experiments. Combine SUT and FPT, the workload is relatively lighter than that of teaching in other heuristics. And through proper teaching schemes, we can make better use of the teaching budget and reduce teacher's workload.

Safety and Efficiency Issues under Sparse User Feedback Scenarios
In real application scenarios, the user rarely provides feedback at the end of the dialogue, so that safety and efficiency issues are even more serious. To observe the effectiveness of different teaching schemes under sparse user feedback, we conducted experiments with sparse user feedback. The user feedback rate is set to 30% empirically and experiments are carried out under teaching schemes consisting of EAPC strategies and different heuristics, since EAPC is much safer and more efficient than other teaching strategies.  Table 4 records the RIs and HTs of those different learning process when user feedback is sparse.
We can see that when the user feedback rate drops from 100% to 30%, the RIs and HTs increase dramatically. The NoTeaching baseline is very risky and inefficient (its hitting time is even unpredictable within 10000 sessions learning). However, with teaching scheme such as EAPC+FPT, both safety and efficiency can be improved a lot.

Conclusions and Future Work
This paper addressed the safety and efficiency issues of sustainable on-line dialogue policy learning with different teaching schemes, which answer both "how" and "when" to teach, within the complete companion teaching framework. To evaluate the policy learning process precisely, we proposed two measurements, Risk Index (RI) and Hitting Time (HT), to quantify the degree of safety and efficiency. Particularly, through multitask learning, we managed to optimize the predicted remaining turns and dialogue success reward explicitly, based on which we developed a novel Failure Prognosis based Teaching (FPT) heuristic to better utilize the fixed teaching budget and make the teaching affordable.
Experiments showed that different teaching schemes have different effects on safety and efficiency dimension. And they also require different workload of the teacher. Among 18 compared teaching schemes, FPT-based heuristics combined with EAPC strategy achieved promising performance on RI and HT, and required relatively slight workload. This result indicates a proper teaching scheme under the companion teaching framework is able to guarantee a sustainable and affordable on-line dialogue policy learning process.
There are several directions for our future work. We expect to deploy our proposed framework in real-world scenarios collaborating with real human teachers to verify the results presented in this paper and discover more potential challenges of on-line dialogue system development. Furthermore, the current study is focused on dialogue success rate, which is a simplification of the human satisfaction evaluation. So future work is needed to take more qualities into consideration to achieve better user experience.