Autonomous Sub-domain Modeling for Dialogue Policy with Hierarchical Deep Reinforcement Learning

Solving composites tasks, which consist of several inherent sub-tasks, remains a challenge in the research area of dialogue. Current studies have tackled this issue by manually decomposing the composite tasks into several sub-domains. However, much human effort is inevitable. This paper proposes a dialogue framework that autonomously models meaningful sub-domains and learns the policy over them. Our experiments show that our framework outperforms the baseline without subdomains by 11% in terms of success rate, and is competitive with that with manually defined sub-domains.


Introduction
Modeling a composite dialogue (Peng et al., 2017), which consists of several inherent subtasks, is in high demand due to the complexity of human conversation. For instance, a composite dialogue of making a hotel reservation involves several sub-tasks, such as looking for a hotel that meets the user's constraints, booking the room, and paying for the room. The completion of a composite dialogue requires the fulfillment of all involved sub-tasks. In this paper, we focus on the development of a dialogue agent that can discover inherent sub-tasks autonomously from a composite domain, learn a policy to fulfill each sub-task, and learn a policy among these sub-tasks to solve the composite task. Composite dialogues are different from multi-domain dialogues. In multidomain dialogue systems (Cuayáhuitl et al., 2016;Gasic et al., 2016), each dialogue typically involves one domain, and consequently, its fulfillment does not need policy across domains.
To develop a dialogue agent that can handle a composite task, using standard flat reinforcement learning (RL), which are often used for dialogues with a simple task (Young et al., 2013;Gašić and Young, 2014;Williams et al., 2017;Casanueva et al., 2017;, might be inappropriate. Flat RL methods, such as DQN (Mnih et al., 2015), could suffer from the curse of dimensionality, that is the number of parameters to be learned grows exponentially with the size of any compact encoding of system state. Therefore, flat RL is unable to learn reliable value functions (Kulkarni et al., 2016) for a composite task. A composite task has a larger state space and action set, longer trajectory, and more sparse rewards than a simple task. Hierarchical reinforcement learning (HRL) (Dietterich, 2000;Parr and Russell, 1997) is a technique to model complex dialogues (Cuayáhuitl, 2009). Peng et al. (2017) and  used the options framework (Sutton et al., 1999) to solve the above problems in composite dialogues and showed its superiority over flat RL. In their work, however, each option (i.e. sub-task) and its property (e.g. starting and terminating conditions, and valid action set) had to be manually defined. Such handcrafted options ease the policy learning in a composite task, but much human effort is inevitable.
To solve the above problems, we propose to model sub-domains autonomously without any human intervention. The modeled sub-domains imitate the intentions to fulfill sub-tasks in a dialogue, which consequently can be reused by similar yet different domains. Challenges to achieve such autonomous sub-domain modeling include (i) how to discover meaningful sub-domains and their properties (i.e. starting conditions, terminating conditions, and the policies), and (ii) how to have a coherent interaction among these subdomains so that the dialogue agent can accomplish a dialogue goal efficiently. To tackle these challenges, we propose a unified framework that integrates option discovery (Bacon et al., 2017;Machado et al., 2017) with HRL to learn the opti-mal policies over options. With an evaluation involving a task of reserving hotel room, we confirm that our framework achieves a significant improvement over flat RL by 11% in terms of success rate, and is competitive with the framework with manually defined options .

Hierarchical Policy Management
A composite task can be decomposed into a sequence of sub-domains, which are also called options. The composite task is accomplished when all these sub-domains are fulfilled. Following the options framework (Sutton et al., 1999), our dialogue agent handles the composite task by designing two levels of policies in a hierarchical structure, as shown in Figure 1. In this hierarchical policy framework, S denotes the dialogue state space, Ω the option space, and A the action set. For a dialogue state s ∈ S, the toplevel policy π Ω determines which option ω ∈ Ω should be chosen. Then, the policy π ω determines which primitive action a ∈ A should be chosen in option ω for s. As shown by the example in Figure 2, a primitive action is an action lasting for one time step, while an option is an action lasting several time steps. For each s, a dialogue action, which is a primitive action, is returned to the user. The dialogue system will receive an extrinsic reward r e and a new belief state s . An optimal policy π * maximizes the expected discounted return G t = E π,P [ ∞ k=0 γ k r e,t+k+1 |s t ] at every time step t, where P is a transition probability kernel, γ ∈ [0, 1] is a discount factor, and r e,t is the extrinsic reward obtained at step t . Figure 2 shows an example of the execution of our hierarchical dialogue policy in a dialogue domain about hotel room reservation. This domain comprises two sub-domains, i.e., searching for a hotel and booking a hotel room. In this example, we assume that the dialogue system has prior knowledge regarding these sub-domains. In this paper, we propose a dialogue framework that can autonomously discover such sub-domains.

Option-Critic Architecture
OC is a gradient-based approach for simultaneously learning intra-option policies π ω and termination functions β ω . It learns options gradually from its interactions with environment. It uses option value function Q Ω (s, ω) defined as follows.
is the value of executing an action in the context of a state-option pair, and U (ω, s ) is the utility from s onwards, given that we arrive in s using ω. We parameterize π ω by θ and β ω by ϑ. The learning algorithm of OC involves two steps: • options evaluation: updating Q Ω and Q U with temporal difference errors; and • options improvement: updating θ with ∂Q Ω ∂θ and ϑ with ∂Q Ω ∂ϑ . To obtain policy π Ω over options, we combine OC with intra-option Q-learning (Sutton et al., 1999). Hereinafter, this combination is denoted as HRL-OC.
HRL-OC optimizes the options and their policies for maximizing the cumulative extrinsic reward. It is focused less on discovering meaningful options (Bacon et al., 2017), which may result in unnatural sub-domains in a successful conversation. To tackle this issue, we use PVFs, which are capable of capturing the geometry of the state space, to discover meaningful sub-domains.

Proto-Value Functions as Options
Proto-value functions (PVFs) are learned representations that approximate state-value function in RL (Mahadevan, 2007). Machado et al. (2017) Low-level policy: =1è := O; ñ 4 : hotel_searching ñ 5 : hotel_booking Figure 2: An example of the execution of our hierarchical dialogue policy in hotel reservation domain. At time t = 0 and t = k, given the belief state s t , top-level policy π Ω takes options ω 0 and ω 1 , respectively. ω 0 lasts for k turns until its policy π ω i takes terminate action, while ω 1 lasts for n − k turns.
further demonstrated that PVFs implicitly define options. PVF-based option discovery extracts options from the topology structure of the state space and is capable of providing dense intrinsic rewards for each option. The discovery process is given below.
Given a set of sampled state transitions, we construct an adjacency matrix W between belief states using Gaussian kernel. Then, we apply eigendecomposition to the combinatorial graph Laplacian of W . Each eigenvector (i.e. PVF) e ω corresponds to an option with intrinsic reward func- for a state transition from s to s . Since our dialogue system has continuous belief states, we interpolate the value of eigenvectors to novel states using Nyström approximation (Mahadevan, 2007). The number of generated intrinsic reward functions is equal to the number of dialogue states in W , but we used intrinsic reward functions from eigenvectors with the smallest eigenvalues.
An option ω, which corresponds to an eigenvector e ω , can be interpreted as a desire to reach a belief state s that has the highest value of e ω [s] (Machado et al., 2017). In our experiment, such a state usually represents a dialogue goal or a state where user's inherent sub-domain changes (e.g. user starts the booking sub-domain once she finds the hotel satisfying her requirements).

Policy Learning with Intrinsic Rewards
To realize a dialogue framework that can discover effective and meaningful sub-domains, we feed PVFs into HRL-OC, then follows HRL-OC's learning procedure. Here, PVFs act as an internal evaluator of the dialogue policy. We formulate the r(s, a) in Q U to be r(s, a) = αr ω i + (1 − α)r e . Hereinafter, this model is denoted as HRL-OC PVF. We can regard HRL-OC as HRL-OC PVF with α = 0.
We also introduce alternative dialogue frameworks by applying the intrinsic rewards from PVFs directly to HRL algorithms. We train each policy π ω in HRL using a specific intrinsic reward function r ω i . We implemented the hierarchical deep Q-networks (HRL-DQN; Kulkarni et al. (2016)), and policy gradient-DQN (HRL-PG DQN), i.e., REINFORCE (Williams, 1992) as the top-level policy and DQN low-level policy. This assesses whether using only general-purpose intrinsic rewards, which are designed for exploration, is good for maximizing extrinsic rewards.

Experimental Setup
We conducted three evaluations on (i) the effectiveness of our autonomous sub-domain modeling compared to the manual sub-domain modeling, (ii) the performance difference between flat RL (i.e. without modeling) and the HRL with autonomous modeling, and (iii) the impact of using PVFs in discovering meaningful sub-domains.

Dialogue Domain
Following the setting in , we evaluated our proposed framework in the task of reserving a hotel room, which involves three sub-domains: searching for a hotel, booking, and payment. This domain has 13 constraint slots, that is 5 slots in hotel searching (price, kind, area, stars, hasparking), 5 slots in booking (day, hour, duration, peopleno, surname), and 3 slots in payment sub-domain (address, cardno, surname). Dialogue management over this dialogue domain is cast as a Markov Decision Process (MDP) with the following specification.
• State: the belief state s ∈ S with 239 dimensions that captures distribution over user's intents and requestable slots • Action set A: 44 dialogue actions, which consists of 8 slot-independent actions and 36 slot-dependent actions.
• Reward: -1 at each turn, and 0 or 20 (failed or success dialogue) at the end of dialogue • Discount factor γ: 0.95 • Maximum number of turns: 30

User Simulator
We used an agenda-based user simulator (Schatzmann et al., 2007) with which the belief states perfectly capture the user intent. At the start of each dialogue, the simulated user randomly sets its goal that consists of searching for a hotel and either booking it or paying for it. User will proceed to the booking or payment sub-domain only after achieving the goal of the hotel searching sub-domain. At the beginning of each sub-domain execution, the user's goal for that sub-domain is randomly generated using database. The agenda is populated by converting all goal constraints into inform acts, and all goal requests into request acts. For instance, inform(price=moderate) indicates a user requirement, and request(address) indicates the user asking for the address of the hotel returned by the system. Furthermore, in different dialogue episodes, the simulated user might convey its requirements (i.e. slot values) within a sub-domain to the dialogue system in different orders.

Dialogue Frameworks
Implementation As the benchmarks without sub-domain, we used flat RL algorithms (i.e. DQN, and PG with REINFORCE). For the benchmark with manual modeling, we used the framework introduced by Budzianowski et al. (2017), which utilized hierarchical Gaussian Process RL (HRL-GP).
All deep (flat and hierarchical) RL agents consist of 2 hidden layers (150 units in layer 1, and 75 (70 for PG) in layer 2). We used Adam optimizer, a mini-batch size of 32, and -greedy strategy for exploration. In HRL-DQN and HRL-PG DQN agents, top-level and low-level policies have separate policy networks, each of which has 2 hidden layers as specified above. In these agents, the lowlevel policies share the same policy network. Dur-ing execution, we pass the information of the option taken by the top-level policy to the low-level policy network. In HRL-OC and HRL-OC PVF agents, the policy, the critic Q Ω , and the termination networks share the same 2 hidden layers, but each of them has its own output layer.
For discovering PVFs, we generated state transition samples using hand-crafted rules . We sub-sampled 1,000 unique states using trajectory sampling, and built W from them.

Prior Knowledge
In the manual sub-domain modeling, the agent has two types of prior knowledge as follows.
• a valid action set for each sub-domain.
All sub-domains share the same 8 slotindependent actions, but each of them has its own slot-dependent actions.
To assess the impact of each type of prior knowledge, we implemented an HRL-GP framework that uses both types of knowledge and its variant HRL-GP2 that uses only sub-domain information. Both frameworks have separated policies to handle each sub-domain, but HRL-GP2 deals with a more complex situation since it has to select an action from the union of actions sets from all sub-domains, that is 44 dialogue actions in total. Unlike HRL-GP and HRL-GP2, our frameworks with autonomous modeling (HRL-DQN, HRL-PG DQN, HRL-OC, HRL-OC PVF) cannot access any prior knowledge. They initially perceive dialogues as a single domain problem and attempt to discover the meaningful sub-domains.
Evaluation We trained each policy in the frameworks for 30 iterations, each of which consists of 200 episodes. In the end of each iteration, we evaluated the performance of the models using 200 episodes. The metric we used for evaluation is the average success rate (SR) of dialogues.

Success Rate
The experimental results are shown in Table 1 and Figure 3. First, the flat RL, which is a DQN, achieved an SR of up to 66.9%, but it was unstable. The more stable flat framework, PG, obtained 62%. Our frameworks with autonomous modeling (HRL-OC and HRL-OC PVF) outperformed flat RL significantly. However, HRL-DQN and HRL-PG DQN performed worse than flat RL. This suggests that using only intrinsic rewards from PVFs is not adequate for constructing sub-domains that are effective in accumulating extrinsic rewards.
Since HRL-OC optimizes its options for maximizing the accumulated extrinsic reward, it has a better SR compared to HRL-DQN and HRL-PG DQN, which did not use any extrinsic rewards. The frameworks with manual modeling, i.e. HRL-GP and HRL-GP2, reached an SR of 84.8 and 75.9%, respectively. One of the frameworks with autonomous modeling (i.e. HRL-OC) achieved up to 73.4%. Note that, in HRL-OC, all primitive actions are used for each option, which is the same as HRL-GP2. Although HRL-OC does not have any prior knowledge about sub-domains in a dialogue, it is competitive with the framework with strong supervision on sub-domains. This indicates that HRL-OC is able to learn effective subgoals in a composite-task dialogue.
As shown in Figure 3, learning curves of different dialogue frameworks are examined. Figure 3(a) shows that HRL-OC and HRL-OC PVF have steeper learning curves than HRL-GP in the first 1000 episodes, which indicates that our frameworks can shorten learning time. Figure 3 (b) reports that the use of 2 or 3 options is optimal. Using too many options is harmful because the agent will require more episodes to learn the optimal policy over options. Figure 3(c) shows the ef-fect of the interpolation ratio α for combining both extrinsic and intrinsic rewards on the SR. However, PVFs seem ineffective with respect to SR. To have α > 0 reduces the SR of HRL-OC PVF with 3 options.

Discovered Sub-domains
According to our observation, the HRL-OC PVF with α = 0.2, however, discovered more meaningful sub-domains than HRL-OC. To assess the meaningfulness of the discovered sub-domains, we examined how similar these sub-domains to those inherent in the user's agenda. We judge the similarity using the average dialogue turn distance between the turn when the user simulator enters a sub-domain and that when the agent switches sub-domains. The ideal case is to have a turn distance of 1, i.e., once a user enters a sub-domain, the agent responds by switching the active option in the next turn. Table 2 shows that compared to HRL-OC, the integration of PVFs results in subdomains whose boundaries are similar to those of the user's sub-domains. Table 3 shows that the integration of PVFs into HRL-OC makes the agent capable of changing the active sub-domain soon after the user enters a subdomain. This indicates that PVFs can detect interesting belief states. In our further examination, PVFs successfully discover states that indicate dialogue goal, sub-task switching, and request of alternatives from the sampled transitions.

Discussion
Our experiments show that our proposed framework outperforms the baseline, and is competitive with with the framework with manually defined sub-domains. Even though the experiments are done using a simulator, the simulated user produces dialogue behavior realistic enough for training and testing. As mentioned in Section 4.2, the simulated user specifies its requirements within a sub-domain to the dialogue system in a random order. In addition, the simulator may also not specify several slot values. Such a behavior simulates a situation in which a human user forgets to specify some goal constraints.
In the experiments, the simulator has a constraint, that is it executes the inherent sub-domains in a fixed order. The fixed order of sub-domains, i.e. hotel search and then followed by either booking or payment, can still simulate the real world conversational data, since an activity of reserving a hotel room is commonly accomplished in such order. In other tasks, however, a fixed order of inherent sub-domains may not simulate the real conversation well. Nevertheless, even when the order of the inherent sub-domains are not fixed, we suggest that our proposed framework could still discover options that imitate the inherent sub-domains. This holds when the inherent sub-domains are executed sequentially, and the environment dynamics within each inherent sub-domain is invariant to the execution order of the sub-domains. Another challenging situation is when the inherent sub-domains are executed in an interleaved manner. This simulates a scenario in which a user frequently switches the active subdomain before the current sub-domain is fulfilled. A further investigation is required to examine the options discovered in such a situation.

Conclusion
We proposed a framework that autonomously discovers sub-domains for a composite-task dialogue. Experimental results shows that our framework with autonomous modeling is competitive with the framework with manually defined sub-domains. Analysis also showed that the integration of PVFs leads to meaningful sub-domains.
For future work, we consider the adjustment of the PVFs construction, such as the distance metric between states, the construction of the adjacency matrix, and the use of successor representation (Dayan, 1993;Barreto et al., 2017). We may also need to further examine the discovered options when the inherent sub-domains are executed in several different manners and orders. Finally, it is also interesting to investigate the effectiveness of reusing the learned options in other related dialogue domains.