Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog

Dialog policy decides what and how a task-oriented dialog system will respond, and plays a vital role in delivering effective conversations. Many studies apply Reinforcement Learning to learn a dialog policy with the reward function which requires elaborate design and pre-specified user goals. With the growing needs to handle complex goals across multiple domains, such manually designed reward functions are not affordable to deal with the complexity of real-world tasks. To this end, we propose Guided Dialog Policy Learning, a novel algorithm based on Adversarial Inverse Reinforcement Learning for joint reward estimation and policy optimization in multi-domain task-oriented dialog. The proposed approach estimates the reward signal and infers the user goal in the dialog sessions. The reward estimator evaluates the state-action pairs so that it can guide the dialog policy at each dialog turn. Extensive experiments on a multi-domain dialog dataset show that the dialog policy guided by the learned reward function achieves remarkably higher task success than state-of-the-art baselines.


Introduction
Dialog policy, deciding the next action that the dialog agent should take at each turn, is a crucial component of a task-oriented dialog system. Among many models, Reinforcement Learning (RL) is commonly used to learn dialog policy (Fatemi et al., 2016;Peng et al., 2017;Yarats and Lewis, 2018;Lei et al., 2018;He et al., 2018;, where users are modeled as a part of the environment and the policy is learned through interactions with users. While it is too expensive to learn directly from real users since RL requires a large number of Table 1: An example of the multi-domain task-oriented dialog between the user (U) and the system (S). The dialog proceeds successfully because the system informs the user that no matching hotel exists (the first turn), identifies the new user goal about parking (the second turn), and shifts the topic to the restaurant domain (the third turn), which well understands the user's demand. samples to train, most existing studies use datadriven approaches to build a dialog system from conversational corpora (Zhao and Eskenazi, 2016;Dhingra et al., 2017;Shi and Yu, 2018), where a common strategy is to build a user simulator, and then to learn dialog policy through making simulated interactions between an agent and the simulator. A typical reward function on policy learning consists of a small negative penalty at each turn to encourage a shorter session, and a large positive reward when the session ends successfully if the agent completes the user goal.
However, specifying an effective reward function is challenging in task-oriented dialog. On one hand, the short dialogs resulted from the negative constant rewards are not always efficient. The agent may end a session too quickly to complete the task properly. For example, it is inappropriate to book a 3-star hotel without confirming with the user at the first turn in Table 1. On the other hand, an explicit user goal is essential to evaluate the task success in the reward design, but user goals are hardly available in real situations (Su et al., 2016). In addition, the user goal may change as the conversation proceeds. For instance, the user introduces a new requirement for the parking information at the second turn in Table 1.
Unlike a handcrafted reward function that only evaluates the task success at the end of a session, a good reward function should be able to guide the policy dynamically to complete the task during the conversation. We refer to this as the reward sparsity issue. Furthermore, the reward function is often manually tweaked until the dialog policy performs desired behaviors. With the growing needs for the system to handle complex tasks across multiple domains, a more sophisticated reward function would be designed, which poses a serious challenge to manually trade off those different factors.
In this paper, we propose a novel model for learning task-oriented dialog policy. The model includes a robust dialog reward estimator based on Inverse Reinforcement Learning (IRL). The main idea is to automatically infer the reward and goal that motivates human behaviors and interactions from the real human-human dialog sessions. Different from conventional IRL that learns a reward function first and then trains the policy, we integrate Adversarial Learning (AL) into the method so that the policy and reward estimator can be learned simultaneously in an alternate way, thus improving each other during training. To deal with reward sparsity, the reward estimator evaluates the generated dialog session using state-action pairs instead of the entire session, which provides reward signals at each dialog turn and guides dialog policy learning better.
To evaluate the proposed approach, we conduct our experiments on a multi-domain, multi-intent task-oriented dialog corpus. The corpus involves large state and action spaces, multiple decision making in one turn, which makes it more challenging for the reward estimator to infer the user goal. Furthermore, we experiment with two different user simulators. The contributions of our work are in three folds: • We build a reward estimator via Inverse Reinforcement Learning (IRL) to infer an appropriate reward from multi-domain dialog sessions, in order to avoid manual design of reward function.
• We integrate Adversarial Learning (AL) to train the policy and estimator simultaneously, and evaluate the policy using state-action pairs to better guide dialog policy learning.
• We conduct experiments on the multidomain, multi-intent task-oriented dialog corpus, with different types of user simulators. Results show the superiority of our model to the state-of-the-art baselines.
2 Related Work 2.1 Multi-Domain Dialog Policy Learning Some recent efforts have been paid to multidomain task-oriented dialog systems where users converse with the agent across multiple domains. A natural way to handle multi-domain dialog systems is to learn multiple independent singledomain sub-policies (Wang et al., 2014;Gašić et al., 2015;Cuayáhuitl et al., 2016). Multidomain dialog completion was also addressed by hierarchical RL which decomposes the task into several sub-tasks in terms of temporal order (Peng et al., 2017) or space abstraction , but the hierarchical structure can be very complex and constraints between different domains should be considered if an agent conveys multiple intents.

Reward Learning in Dialog Systems
Handcrafted reward functions for dialog policy learning require elaborate design. Several reward learning algorithms have been proposed to find better rewards, including supervised learning on expert dialogs (Li et al., 2014), online active learning from user feedback (Su et al., 2016), multiobject RL to aggregate measurements of various aspects of user satisfaction (Ultes et al., 2017), etc. However, these methods still require some knowledge about user goals or annotations of dialog ratings from real users. Boularias et al. (2010) and Barahona and Cerisara (2014) learn the reward from dialogs using linear programming based on IRL, but do not scale well in real applications.
Recently, Liu and Lane (2018) use adversarial rewards as the only source of reward signal. It trains a Bi-LSTM as a discriminator that works on the entire session to predict the task success.

Adversarial Inverse Reinforcement Learning
IRL aims to infer the reward function R underlying expert demonstrations sampled from humans or the optimal policy π * . This is similar to the discriminator network in AL that evaluates how realistic the sample looks. Finn et al. (2016) draw a strong connection between GAN and maximum entropy causal IRL (Ziebart et al., 2010) by replacing the estimated data density in AL with the Boltzmann distribution in IRL, i.e. p(x) ∝ exp(−E(x)). Several approaches (Ho and Ermon, 2016;Fu et al., 2018) obtain a promising result on automatic reward estimation in large, highdimensional environments by combining AL with IRL. Inspired by this, we apply AIRL to complex, multi-domain task-oriented dialog, which faces new issues such as discrete action space and language understanding.

Guided Dialog Policy Learning
We propose Guided Dialog Policy Learning (GDPL), a flexible and practical method on joint reward learning and policy optimization for multidomain task-oriented dialog systems.

Overview
The overview of the full model is depicted in Fig.  1. The framework consists of three modules: a multi-domain Dialog State Tracker (DST) at the dialog act level, a dialog policy module for deciding the next dialog act, and a reward estimator for policy evaluation. Specifically, given a set of collected human dialog sessions D = {τ 1 , τ 2 , . . . }, each dialog session τ is a trajectory of state-action pairs {s u 0 , a u 0 , s 0 , a 0 , s u 1 , a u 1 , s 1 , a 1 , . . . }. The user simulator µ(a u , t u |s u ) posts a response a u according to the user dialog state s u where t u denotes a binary terminal signal indicating whether the user wants to end the dialog session. The dialog policy π θ (a|s) decides the action a according to the cur-rent state s and interacts with the simulator µ. During the conversation, DST records the action from one dialog party and returns the state to the other party for deciding what action to take in the next step. Then, the reward estimator f ω (s, a) evaluates the quality of the response from the dialog policy, by comparing it with sampled human dialog sessions from the corpus. The dialog policy π and the reward estimator f are MLPs parameterized by θ, ω respectively. Note that our approach does not need any human supervision during training, and modeling a user simulator is beyond the scope of this paper.
In the subsequent subsections, we will first explain the state, action, and DST used in our algorithm. Then, the algorithm is introduced in a session level, and last followed by a decomposition of state-action pair level.

Multi-Domain Dialog State Tracker
A dialog state tracker keeps track of the dialog session to update the dialog state (Williams et al., 2016;. It records informable slots about the constraints from users and requestable slots that indicates what users want to inquiry. DST maintains a separate belief state for each slot. Given a user action, the belief state of its slot type is updated according to its slot value (Roy et al., 2000). Action and state in our algorithm are defined as follows: Action : Each system action a or user action a u is a subset of dialog act set A as there may be multiple intents in one dialog turn. A dialog act is an abstract representation of an intention (Stolcke et al., 2000), which can be represented in a quadruple composed of domain, intent, slot type and slot value in the multi-domain setting (e.g. [restaurant, inform, food, Italian]). In practice, dialog acts are delexicalized in the dialog policy. We replace the slot value with a count placeholder and refill it with the true value according to the entity selected from the external database, which allows the system to operate on unseen values.
State : At dialog turn t 1 , the system state s t = [a u t ; a t−1 ; b t ; q t ] consists of (I) user action at current turn a u t ; (II) system action at the last turn a t−1 ; (III) all belief state b t from DST; and (IV) embedding vectors of the number of query results q t from the external database.
As our model works at the dialog act level, DST can be simply implemented by extracting the slots from actions.

Session Level Reward Estimation
Based on maximum entropy IRL (Ziebart et al., 2008), the reward estimator maximizes the log likelihood of observed human dialog sessions to infer the underlying goal, where f models human dialogs as a Boltzmann distribution (Ziebart et al., 2008), R stands for the return of a session, i.e. γ-discounted cumulative rewards, and Z is the corresponding partition function.
The dialog policy is encouraged to mimic human dialog behaviors. It maximizes the expected entropy-regularized return E π [R] + H(π) (Ziebart et al., 2010) based on the principle of maximum entropy through minimizing the KL-divergence between the policy distribution and Boltzmann distribution, where the term log Z ω is independent to θ, and H(·) denotes the entropy of a model. Intuitively, maximizing entropy is to resolve the ambiguity of language that many optimal policies can explain a set of natural dialog sessions. With the aid of the likelihood ratio trick, the gradient for the dialog policy is In the fashion of AL, the reward estimator aims to distinguish real human sessions and generated sessions from the dialog policy. Therefore, it minimizes KL-divergence with the empirical distribution, while maximizing the KL-divergence with the policy distribution, Compute the estimated reward of each state-action pair in D Π , Update π, V using the estimated rewardr by maximizing J π , J V w.r.t. θ (Eq. 3 and Eq. 4) 7 end Similarly, H(p) and H(π θ ) is independent to ω, so the gradient for the reward estimator yields

State-Action Level Reward Estimation
So far, the reward estimation uses the entire session τ , which can be very inefficient because of reward sparsity and may be of high variance due to the different lengths of sessions. Here we decompose a session τ into state-action pairs (s, a) in the reward estimator to address the issues. Therefore, the loss functions for the dialog policy and the reward estimator become respectively as follows: (1) where T is the number of dialog turns. Since the reward estimator evaluates a state-action pair, it can guide the dialog policy at each dialog turn with the predicted rewardr ω (s, a) = f ω (s, a) − log π θ (a|s). Moreover, the reward estimator f ω can be transformed to a reward approximator g ω and a shaping term h ω according to (Fu et al., 2018) to recover an interpretable and robust reward from real human sessions. Formally, where we replace the state-action pair (s t , a t ) with the state-action-state triple (s t , a t , s t+1 ) as the input of the reward estimator. Note that, different from the objective in (Fu et al., 2018) that learns a discriminator in the form D ω (s, a) = pω(s,a) pω(s,a)+π(a|s) , GDPL directly optimizes f ω , which avoids unstable or vanishing gradient issue in vanilla GAN (Arjovsky et al., 2017).
In practice, we apply Proximal Policy Optimization (PPO) (Schulman et al., 2017), a simple and stable policy based RL algorithm using a constant clipping mechanism as the soft constraint for dialog policy optimization, where V θ is the approximate value function, β t = π θ (at|st) π θ old (at|st) is the ratio of the probability under the new and old policies,Â is the estimated advantage, δ is TD residual, λ and are hyper-parameters. In summary, a brief script for GPDL algorithm is shown in Algorithm 1.

Data and Simulators
We use MultiWOZ , a multi-domain, multi-intent task-oriented dialog corpus that contains 7 domains, 13 intents, 25 slot types, 10,483 dialog sessions, and 71,544 dialog turns in our experiments. Among all the sessions, 1,000 each are used for validation and test. During the data collection process, a user is asked to follow a pre-specified user goal, but it encourages the user to change its goal during the session and the changed goal is also stored, so the collected dialogs are much closer to reality. The corpus also provides the ontology that defines all the entity attributes for the external database.
We apply two user simulators as the interaction environment for the agent. One is the agendabased user simulator (Schatzmann et al., 2007) which uses heuristics, and the other is a datadriven neural model, namely, Variational Hierarchical User Simulator (VHUS) derived from (Gür et al., 2018). Both simulators initialize a user goal when the dialog starts 2 , provide the agent with a simulated user response at each dialog turn, and work at the dialog act level. Since the original corpus only annotates the dialog acts at the system side, we use the annotation at the user side from ConvLab (Lee et al., 2019) to implement the two simulators.

Evaluation Metrics
Evaluation of a task-oriented dialog mainly consists of the cost (dialog turns) and task success (inform F1 & match rate). The definition of inform F1 and match rate is explained as follows.
Inform F1 : This evaluates whether all the requested information (e.g. address, phone number of a hotel) has been informed. Here we compute the F1 score so that a policy which greedily answers all the attributes of an entity will only get a high recall but a low precision.
Match rate : This evaluates whether the booked entities match all the indicated constraints (e.g. Japanese food in the center of the city) for all domains. If the agent fails to book an entity in one domain, it will obtain 0 score on that domain. This metric ranges from 0 to 1 for each domain, and the average on all domains stands for the score of a session.
Finally, a dialog is considered successful only if all the information is provided (i.e. inform recall = 1) and the entities are correctly booked (i.e. match rate = 1) as well 3 . Dialog success is either 0 or 1 for each session.

Implementation Details
Both the dialog policy π(a|s) and the value function V (s) are implemented with two hidden layer MLPs. For the reward estimator f (s, a), it is split into two networks g(s, a) and h(s) according to the proposed algorithm, where each is a one hidden layer MLP. The activation function is all Relu  for MLPs. We use Adam as the optimization algorithm. The hyper-parameters of GDPL used in our experiments are shown in Table 2.

Baselines
First of all, we introduce three baselines that use handcrafted reward functions. Following (Peng et al., 2017), the agent receives a positive reward of 2L for success at the end of each dialog, or a negative reward of −L for failure, where L is the maximum number of turns in each dialog and is set to 40 in our experiments. Furthermore, the agent receives a reward of −1 at each turn so that a shorter dialog is encouraged.

GP-MBCM (Gašić et al., 2015): Multi-domain
Bayesian Committee Machine for dialog management based on Gaussian process, which decomposes the dialog policy into several domainspecific policies.
ACER (Wang et al., 2017): Actor-Critic RL policy with Experience Replay, a sample efficient learning algorithm that has low variance and scales well with large discrete action spaces.
PPO (Schulman et al., 2017): The same as the dialog policy in GDPL. Then, we also compare with another strong baseline that involves reward learning.
ALDM (Liu and Lane, 2018): Adversarial Learning Dialog Model that learns dialog rewards with a Bi-LSTM encoding the dialog sequence as the discriminator to predict the task success. The reward is only estimated at the end of the session and is further used to optimize the dialog policy.
For a fair comparison, each method is pretrained for 5 epoches by simple imitation learning on the state-action pairs.

Main Results
The performance of each approach that interacts with the agenda-based user simulator is shown in   Table 3. GDPL achieves extremely high performance in the task success on account of the substantial improvement in inform F1 and match rate over the baselines. Since the reward estimator of GDPL evaluates state-action pairs, it can always guide the dialog policy during the conversation thus leading the dialog policy to a successful strategy, which also indirectly demonstrates that the reward estimator has learned a reasonable reward at each dialog turn. Surprisingly, GDPL even outperforms human in completing the task, and its average dialog turns are close to those of humans, though GDPL is inferior in terms of match rate. Humans almost manage to make a reservation in each session, which contributes to high task success. However, it is also interesting to find that human have low inform F1, and that may explain why the task is not always completed successfully. Actually, there have high recall (86.75%) but low precision (54.43%) in human dialogs when answering the requested information. This is possibly because during data collection human users forget to ask for all required information of the task, as reported in (Su et al., 2016).
ACER and PPO obtain high performance in inform F1 and match rate as well. However, they obtain poor performance on the overall task success, even when they are provided with the designed reward that already knows the real user goals. This is because they only receive the reward about the success at the last turn and fail to understand what the user needs or detect the change of user goals.   Table 4: KL-divergence between different dialog policy and the human dialog KL(π turns ||p turns ), where π turns denotes the discrete distribution over the number of dialog turns of simulated sessions between the policy π and the agenda-based user simulator, and p turns for the real human-human dialog.
on task success by encoding the entire session in its reward estimator. This demonstrates that learning effective rewards can help the policy to capture user intent shift, but the reward sparsity issue remains unsolved. This may explain why the gain is limited, and ALDM even has longer dialog turns than others. In conclusion, the dialog policy benefits from the guidance of the reward estimator per dialog turn. Moreover, GDPL can establish an efficient dialog thanks to the learned rewards that infer human behaviors. Table 4 shows that GDPL has the smallest KL-divergence to the human on the number of dialog turns over the baselines, which implies that GDPL behaves more like the human. It seems that all the approaches generate many more short dialogs (dialog turns less than 3) than human, but GDPL generates far less long dialogs (dialog turns larger than 11) than other baselines except GP-MBCM. Most of the long dialog sessions fail to reach a task success.
We also observe that GP-MBCM tries to provide many dialog acts to avoid the negative penalty at each turn, which results in a very low inform F1 and short dialog turns. However, as explained in the introduction, a shorter dialog is not always the best. The dialog generated by GP-MBCM is  too short to complete the task successfully. GP-MBCM is a typical case that focuses too much on the cost of the dialog due to the handcrafted reward function and fails to realize the true target that helps the users to accomplish their goals.

Ablation Study
Ablation test is investigated in Table 3. GDPLsess sums up all the rewards at each turn to the last turn and does not give any other reward before the dialog terminates, while GDPL-discr is to use the discriminator form as (Fu et al., 2018) in the reward estimator. It is perceptible that GDPL has better performance than GDPL-sess on the task success and is comparable regarding the dialog turns, so it can be concluded that GDPL does benefit from the guidance of the reward estimator at each dialog turn, and well addresses the reward sparsity issue. GDPL also outperforms GDPL-discr which means directly optimizing f ω improves the stability of AL.

Interaction with Neural Simulator
The performance that the agent interacts with VHUS is presented in  it often gives unreasonable responses. Therefore, it is more laborious for the dialog policy to learn a proper strategy with the neural user simulator. All the methods cause a significant drop in performance when interacting with VHUS. ALDM even gets worse performance than ACER and PPO. In comparison, GDPL is still comparable with ACER and PPO, obtains a better match rate, and even achieves higher task success. This indicates that GDPL has learned a more robust reward function than ALDM. In comparison with other approaches, GDPL is more scalable to the number of domains and achieves the best performance in all metrics. PPO suffers from the increasing number of the domain and has remarkable drops in all metrics. This demonstrates the limited capability for the handcrafted reward function to handle complex tasks across multiple domains in the dialog.

Goal across Multiple Domains
ALDM also has a serious performance degradation with 2 domains, but it is interesting to find that ALDM performs much better with 3 domains than with 2 domains. We further observe that ALDM performs well on the taxi domain, most of which appear in the dialogs with 3 domains. Taxi domain has the least slots for constraints and requests, which makes it easier to learn a reward about that domain, thus leading ALDM to a local optimal. In general, our reward estimator has higher effectiveness and scalability.

Human Evaluation
For human evaluation, we hire Amazon Mechanical Turkers to state their preferences between GDPL and other methods. Because all the policies work at dialog act level, we generate the texts  Table 7: Return distribution of GDPL on each metric. The first row counts the dialog sessions that get the full score of the corresponding metric, and the results of the rest sessions are included in the second row.
from dialog acts using hand-crafted templates to make the dialog readable. Given a certain user goal, Turkers first read two simulated dialog sessions, one from the interaction between GDPL and the agenda-based user simulator, the other from another baseline with the same simulator. Then, they are asked to judge which dialog is better (win, draw or lose) according to different subjective assessments. In addition to Task Success, we examine another two measures concerning Dialog Cost in the human evaluation: Efficiency such as dialog turn cost or response delay, and Quality such as redundant information or inappropriate reply (Walker et al., 1997). Since the text is generated by templates for all policies, we do not evaluate language generation here (including grammar, diversity, etc.). We randomly sample 300 user goals from the test set, 100 each for one baseline, and each session pair is evaluated by 3 Turkers. Table 6 presents the results of human evaluation. GDPL outperforms three baselines significantly in all aspects (sign test, p-value < 0.01) except for the quality compared with ACER. Among all the baselines, GDPL obtains the most preference against PPO. Note that the difference between PPO and GDPL is only in the reward signal. This again demonstrates the advantage of reward learning in GDPL over the handcrafted reward function. The agreement on the superiority of GDPL between objective rating in Table 3 and human preference here also indicates that the automatic metrics used in our experiments is reliable to reflect user satisfaction to some extent.

Reward Evaluation
To provide an insight into the learned reward function itself, Table 7 provides a quantitative evaluation on the learned rewards by showing the distribution of the return R = t γ t r t according to each metric. It should be noted that some sessions do not have inform F1 because the user does not request any information, and similarly for match Reward U: I'm looking for a cheap Italian restaurant in the center of the city. S: There are 3 restaurants available for you. U: Can I get the address of one? S: The first result is Pizza Hut City Centre, address is Regent Street City Centre. U: Sounds good. Can you book me a table for 2 at 11:45 on Wednesday? S: You're all set. Your reference number is 7GAWK763. U: I also need a train on Thursday. S: Where are you departing from and heading for? U: From Bishops Stortford to Cambridge. S: Is there a specific time you would like to travel at? U: I would perfer to leave after 21:00. S: There is a train leaving at 21:29. Would you like me to make a reservation? U: Yes, I need the train ID, please. S: The ID is TR6336. Is there any other help you need? U: That will be all. Have a good day! Bye. S: Glad to be of help. Good day to you too. rate. It can be observed that the learned reward function has good interpretability in that the reward is positive when the dialog gets a full score on each metric, and negative otherwise. Fig. 3 gives an illustration of the learned reward at each dialog turn as a qualitative evaluation. In the beginning, the agent is unaware of the user goal thus it starts with a low reward. As the dialog proceeds, the agent has collected enough information from the user, then books the restaurant successfully and the reward remarkably increases at the third turn. The reward continues to grow stably after the topic shifts to the train domain. Again, the agent offers the correct train ID given sufficient information. Since the user has been informed all the requested information and the restaurant and train are both booked successfully, the user leaves the session with satisfaction at last, and the reward rises to the top as well. In brief, the learned reward can well reflect the current state of the dialog. It is also noticeable that the dialog policy manages to express multiple intents during the session.

Discussion
In this paper, we propose a guided policy learning method for joint reward estimation and policy optimization in multi-domain task-oriented dialog. The method is based on Adversarial Inverse Reinforcement Learning. Extensive experiments demonstrate the effectiveness of our proposed ap-4 Refer to the appendix for the dialog acts. proach and that it can achieve higher task success and better user satisfaction than state-of-theart baselines.
Though the action space A of the dialog policy is defined as the set of all dialog acts, it should be noted that GDPL can be equipped with NLU modules that identify the dialog acts expressed in utterance, and with NLG modules that generate utterances from dialog acts. In this way, we can construct the framework in an end-to-end scenario.
The agenda-based user simulator is powerful to provide a simulated interaction for the dialog policy learning, however, it needs careful design and is lack of generalization. While training a neural user simulator is quite challenging due to the high diversity of user modeling and the difficulty of defining a proper reward function, GDPL may offer some solutions for multi-agent dialog policy learning where the user is regarded as another agent and trained with the system agent simultaneously. We leave this as the future work.