Subgoal Discovery for Hierarchical Dialogue Policy Learning

Developing agents to engage in complex goal-oriented dialogues is challenging partly because the main learning signals are very sparse in long conversations. In this paper, we propose a divide-and-conquer approach that discovers and exploits the hidden structure of the task to enable efficient policy learning. First, given successful example dialogues, we propose the Subgoal Discovery Network (SDN) to divide a complex goal-oriented task into a set of simpler subgoals in an unsupervised fashion. We then use these subgoals to learn a multi-level policy by hierarchical reinforcement learning. We demonstrate our method by building a dialogue agent for the composite task of travel planning. Experiments with simulated and real users show that our approach performs competitively against a state-of-the-art method that requires human-defined subgoals. Moreover, we show that the learned subgoals are often human comprehensible.


Introduction
Consider we want to plan a trip to a distant city using a dialogue agent.The agent must make choices at each leg, e.g., whether to fly or to drive, whether to book a hotel.Each of these steps in turn involves making a sequence of decisions all the way down to lower-level actions.For example, to book a hotel involves identifying the location, specifying the check-in date and time, and negotiating the price etc.
The above process of the agent has a natural hierarchy: a top-level process selects which subgoal to complete, and a low-level process chooses primitive actions to accomplish the selected subgoal.Within the reinforcement learning (RL) paradigm, such a hierarchical decision making process can be formulated in the options framework (Sutton et al., 1999), where subgoals with their own reward functions are used to learn policies for achieving these subgoals.These learned policies are then used as temporally extended actions, or options, for solving the entire task.
Based on the options framework, researchers have developed dialogue agents for complex tasks, such as travel planning, using hierarchical reinforcement learning (HRL) (Cuayáhuitl et al., 2010).Recently, Peng et al. (2017b) showed that the use of subgoals mitigates the reward sparsity and leads to more effective exploration for dialogue policy learning.However, these subgoals need to be human-defined which limits the applicability of the approach in practice because the domain knowledge required to properly define subgoals is often not available in many cases.
In this paper, we propose a simple yet effective Subgoal Discovery Network (SDN) that discovers useful subgoals automatically for an RL-based dialogue agent.The SDN takes as input a collection of successful conversations, and identifies "hub" states as subgoals.Intuitively, a hub state is a region in the agent's state space that the agent tends to visit frequently on successful paths to a goal but not on unsuccessful paths.Given the discovered subgoals, HRL can be applied to learn a hierarchical dialogue policy which consists of (1) a toplevel policy that selects among subgoals, and (2) a low-level policy that chooses primitive actions to achieve selected subgoals.
We present the first study of learning dialogue agents with automatically discovered subgoals.We demonstrate the effectiveness of our approach by building a composite task-completion dialogue agent for travel planning.Experiments with both simulated and real users show that an agent learned with discovered subgoals performs competitively against an agent learned using expertdefined subgoals, and significantly outperforms an agent learned without subgoals.We also find that the subgoals discovered by SDN are often human comprehensible.

Background
A goal-oriented dialogue can be formulated as a Markov decision process, or MDP (Levin et al., 2000), in which the agent interacts with its environment over a sequence of discrete steps.At each step t ∈ {0, 1, . ..}, the agent observes the current state s t of the conversation (Henderson, 2015;Mrkšić et al., 2017;Li et al., 2017), and chooses action a t according to a policy π.Here, the action may be a natural-language sentence or a speech act, among others.Then, the agent receives a numerical reward r t and switches to next state s t+1 .The process repeats until the dialogue terminates.The agent is to learn to choose optimal actions {a t } t=1,2,... so as to maximize the total discounted reward r 0 + γr 1 + γ 2 r 2 + • • • , where γ ∈ [0, 1] is a discount factor.This learning paradigm is known as reinforcement learning, or RL (Sutton and Barto, 1998).
When facing a complex task, it is often more efficient to divide it into multiple simpler subtasks, solve them, and combine the partial solutions into a full solution for the original task.Such an approach may be formalized as hierarchical RL (HRL) in the options framework (Sutton et al., 1999).An option can be understood as a subgoal, which consists of an initiation condition (when the subgoal can be triggered), an option policy to solve the subgoal, and a termination condition (when the subgoal is considered finished).
When subgoals are given, there exist effective RL algorithms to learn a hierarchical policy.A major open challenge is the automatic discovery of subgoals from data, the main innovation of this work is covered in the next section.

Subgoal Discovery for HRL
Figure 1 shows the overall workflow of our proposed method of using automatic subgoal discovery for HRL.First a dialogue session is divided into several segments.Then at the end of those segments (subgoals), we equip an intrinsic or extrinsic reward for the HRL algorithm to learn a hierarchical dialogue policy.Note that only the last segment has an extrinsic reward.The details of the segmentation algorithm and how to use subgoals for HRL are presented in Section 3.1 and Section 3.3.

SDN Intrinsic Reward
Extrinsic Reward HRL Intrinsic Reward

Dialogue Session
Figure 1: The workflow for HRL with subgoal discovery.In addition to the extrinsic reward at the end of the dialogue session, HRL also uses intrinsic rewards induced by the subgoals (or the ends of dialogue segments).Section 3.2 details the reward design for HRL.

Subgoal Discovery Network
Assume that we have collected a set of successful state trajectories of a task, as shown in Figure 2. We want to find subgoal states, such as the three red states s 4 , s 9 and s 13 , which form the "hubs" of these trajectories.These hub states indicate the subgoals, and thus divide a state trajectory into several segments, each for an option1 .Assuming that there are three state trajectories (s 0 , s 1 , s 4 , s 6 , s 9 , s 10 , s 13 ), (s 0 , s 2 , s 4 , s 7 , s 9 , s 11 , s 13 ) and (s 0 , s 3 , s 4 , s 8 , s 9 , s 12 , s 13 ).Then red states s 4 , s 9 , s 13 could be good candidates for "subgoals".
Thus, discovering subgoals by identifying hubs in state trajectories is equivalent to segmenting state trajectories into options.In this work, we formulate subgoal discovery as a state trajectory segmentation problem, and address it using the Subgoal Discovery Network (SDN), inspired by the sequence segmentation model (Wang et al., 2017).
The SDN architecture.SDN repeats a twostage process of generating a state trajectory segment, until a trajectory termination symbol is generated: first it uses an initial segment hidden state to start a new segment, or a trajectory termination symbol to terminate the trajectory, given all previous states; if the trajectory is not terminated, then keep generating the next state in this trajectory segment given previous states until a segment termination symbol is generated.We illustrated this process in Figure 3.We model the likelihood of each segment using an RNN, denoted as RNN1.During the training, at each time step, RNN1 predicts the next state with the current state as input, until it reaches the option termination symbol #.Since different options are under different conditions, it is not plausible to apply a fixed initial input to each segment.Therefore, we use another RNN (RNN2) to encode all previous states to provide relevant information and we transform these information to low dimensional representations as the initial inputs for the RNN1 instances.This is based on the causality assumption of the options framework (Sutton et al., 1999) -the agent should be able to determine the next option given all previous information, and this should not depend on information related to any later state.The low dimensional representations are obtained via a global subgoal embedding matrix M ∈ R d×D , where d and D are the dimensionality of RNN1's input layer and RNN2's output layer, respectively.Mathematically, if the output of RNN2 at time step t is o t , then from time t the RNN1 instance has M • softmax(o t ) as its initial input 2 .D is the number of subgoals we aim to learn.Ideally, the vector softmax(o t ) in a well-trained SDN is close to an one-hot vector.Therefore, M • softmax(o t ) should be close to one column in M and we can view that M provides at most D different "embedding vectors" for RNN1 as inputs, indicating at most D different subgoals.Even in the case where softmax(o t ) is not close to any one-hot vector, choosing a small D helps avoid overfitting.
Generally, for state trajectory s = (s 0 , . . ., s T ), we model its likelihood as follows 3 : (1) where S(s) is the set of all possible segmentations for the trajectory s, σ i denotes the i th segment in the segmentation σ, and τ is the concatenation operator.S is an upper limit on the maximal number of segments.This parameter is important for learning subgoals in our setting since we usually prefer a small number of subgoals.This is different from Wang et al. (2017), where a maximum segment length is enforced.
We use maximum likelihood estimation with Eq. (1) for training.However, the number of possible segmentations is exponential in S(s) and the naive enumeration is intractable.Here, dynamic programming is employed to compute the likelihood in Eq. ( 1) efficiently: for a trajectory s = (s 0 , . . ., s T ), if we denote the sub-trajectory (s i , . . ., s t ) of s as s i:t , then its likelihood follows 3 For notation convenience, we include s0 into the observational sequence, though s0 is always conditioned upon.
the below recursion: Here, L m (s 0:t ) denotes the likelihood of subtrajectory s 0:t with no more than m segments and I[•] is an indicator function.p(s i:t |s 0:i ) is the likelihood segment s i:t given the previous history, where RNN1 models the segment and RNN2 models the history as shown in Figure 3.With this recursion, we can compute the likelihood L S (s) for the trajectory s = (s 0 , . . ., s T ) in O(ST 2 ) time.
Learning algorithm.We denote θ s as the model parameter including the parameters of the embedding matrix M , RNN1 and RNN2.We then parameterize the segment likelihood function as p(s i:t |s 0:i ) = p(s i:t |s 0:i ; θ s ), and the trajectory likelihood function as L m (s 0:t ) = L m (s 0:t ; θ s ).

Hierarchical Dialogue Policy Learning
Before describing how we use a trained SDN model for HRL, we first present a short review of HRL for a task-oriented dialogue system.Following the options framework (Sutton et al., 1999), assume that we have a state set S, an option set G and a finite primitive action set A.
The HRL approach we take learns two Qfunctions (Peng et al., 2017b), parameterized by θ e and θ i , respectively: • The top-level Q * (s, g; θ e ) measures the maximum total discounted extrinsic reward received by choosing subgoal g in state s and then following an optimal policy.These extrinsic rewards are the objective to be maximized by the entire dialogue policy.
• The low-level Q * (s, a, g; θ i ) measures the maximum total discounted intrinsic reward received to achieve a given subgoal g, by choosing action a in state s and then following an optimal option policy.These intrinsic rewards are used to learn an option policy to achieve a given subgoal.
Suppose we have a dialogue session of T turns: τ = (s 0 , a 0 , r 0 , . . ., s T ), which is segmented into a sequence of subgoals g 0 , g 1 , . . .∈ G. Consider one of these subgoals g which starts and ends in steps t 0 and t 1 , respectively.
The top-level Q-function is learned using Qlearning, by treating subgoals as temporally extended actions: where and α is the step-size parameter, γ ∈ [0, 1] is a discount factor.In the above expression of q, the first term refers to the total discounted reward during fulfillment of subgoal g, and the second to the maximum total discounted after g is fulfilled.
The low-level Q-function is learned in a similar way, and follows the standard Q-learning update, except that intrinsic rewards for subgoal g are used.Specifically, for t = t 0 , t 0 + 1, . . ., t 1 − 1: Here, the intrinsic reward r i t is provided by the internal critic of dialogue manager.More details are in Appendix A.
In hierarchical policy learning, the combination of the extrinsic and intrinsic rewards is expected to help the agent to successfully accomplish a composite task as fast as possible while trying to avoid unnecessary subtask switches.Hence, we define the extrinsic and intrinsic rewards as follows: Extrinsic Reward.Let L be the maximum number of turns of a dialogue, and K the number of subgoals.At the end of a dialogue, the agent receives a positive extrinsic reward of 2L for a success dialogue, or −L for a failure dialogue; for each turn, the agent receives an extrinsic reward of −1 to encourage shorter dialogues.
Intrinsic Reward.When a subgoal terminates, the agent receives a positive intrinsic reward of 2L/K if a subgoal is completed successfully, or a negative intrinsic reward of −1 otherwise; for each turn, the agent receives an intrinsic reward −1 to encourage shorter dialogues.

Hierarchical Policy Learning with SDN
We use a trained SDN in HRL as follows.The agent starts from the initial state s 0 , keeps sampling the output from the distribution related to the top-level RNN (RNN1) until a termination symbol # is generated, which indicates the agent reaches a subgoal.In this process, intrinsic rewards are generated as specified in the previous subsection.After # is generated, the agent selects a new option, and repeats this process.
This type of naive sampling may allow the option to terminate at some places with a low probability.To stabilize the HRL training, we introduce a threshold p ∈ (0, 1), which directs the agent to terminate an option if and only if the probability of outputting # is at least p.We found this modification leads to better behavior of the HRL agent than the naive sampling method, since it normally has a smaller variance.
In the HRL training, the agent only uses the probability of outputting # to decide subgoal termination.Algorithm 2 outlines the full procedure of one episode for hierarchical dialogue policies with a trained SDN in the composite taskcompletion dialogue system.

Experiments and Results
We evaluate the proposed model on a travel planning scenario for composite task-oriented dialogues (Peng et al., 2017b).Over the exchange of a conversation, the agent gathers information about the user's intent before booking a trip.The environment then assesses a binary outcome (success or failure) at the end of the conversation, based on (1) whether a trip is booked, and (2) whether the trip satisfies the user's constraints.
Algorithm 2 HRL episode with a trained SDN Select a new option o using the agent A. 10: Re-initialize R1 using the latest output from R2 and the embedding matrix M .11: end if 12: end while Dataset.The raw dataset in our experiments is from a publicly available multi-domain dialogue corpus (El Asri et al., 2017).Following Peng et al. (2017b), a few changes were made to introduce dependencies among subtasks.For example, the hotel check-in date should be the same with the departure flight arrival date.The data was mainly used to create simulated users, and to build the knowledge bases for the subtasks of booking flights and reserving hotels.
User Simulator.In order to learn good policies, RL algorithms typically need an environment to interact with.In the dialogue research community, it is common to use simulated users for this purpose (Schatzmann et al., 2007;Li et al., 2017;Liu and Lane, 2017).In this work, we adapted a publicly available user simulator (Li et al., 2016) to the composite task-completion dialogue setting with the dataset described above.During training, the simulator provides the agent with an (extrinsic) reward signal at the end of the dialogue.A dialogue is considered to be successful only when a travel plan is booked successfully, and the information provided by the agent satisfies user's constraints.
Baseline Agents.We benchmarked the proposed agent (referred to as the m-HRL Agent) against three baseline agents: • A Rule Agent uses a sophisticated, hand-crafted dialogue policy, which requests and informs a hand-picked subset of necessary slots, and then confirms with the user about the reserved trip before booking the flight and hotel.• A flat RL Agent is trained with a standard deep reinforcement learning method, DQN (Mnih et al., 2015), which learns a flat dialogue policy using extrinsic rewards only.• A h-HRL Agent is trained with hierarchical deep reinforcement learning (HDQN), which learns a hierarchical dialogue policy based on humandefined subgoals (Peng et al., 2017b).
Collecting State Trajectories.Recall that our subgoal discovery approach takes as input a set of state trajectories which lead to successful outcomes.In practice, one can collect a large set of successful state trajectories, either by asking human experts to demonstrate (e.g., in a call center), or by rolling out a reasonably good policy (e.g., a policy designed by human experts).In this paper, we obtain dialogue state trajectories from a rulebased agent which is handcrafted by a domain expert, the performance of this rule-based agent can achieve success rate of 32.2% as shown in Figure 4 and Table 1.We only collect the successful dialogue sessions from the roll-outs of the rule-based agent, and try to learn the subgoals from these dialogue state trajectories.
Experiment Settings.To train SDN, we use RMSProp (Tieleman and Hinton, 2012) to optimize the model parameters.For both RNN1 and RNN2, we use LSTM (Hochreiter and Schmidhuber, 1997)  fect, dialogue episodes from the rule-based agent in Table 1 and randomly choose 80% of these dialogue state trajectories for training SDN.The remaining 20% were used as a validation set.As illustrated in Section 3.3, SDN starts a new RNN1 instance and issues a subgoal-completion query when the probability of outputting the termination symbol # is above a certain threshold p (as in Algorithm 2).In our experiment, p is set to be 0.2, which was manually picked according to the termination probability during SDN training.
In dialogue policy learning, for the baseline RL agent, we set the size of the hidden layer to 80.For the HRL agents, both top-level and low-level dialogue policies have a hidden layer size of 80. RMSprop was applied to optimize the parameters.We set the batch size to be 16.During training, we used -greedy strategy for exploration with annealing and set γ = 0.95.For each simulation epoch, we simulated 100 dialogues and stored these state transition tuples in the experience replay buffers.At the end of each simulation epoch, the model was updated with all the transition tuples in the buffers in a batch manner.

Simulated User Evaluation
In the composite task-completion dialogue scenario, we compared the proposed m-HRL agent with three baseline agents in terms of three metrics: success rate4 , average rewards and average turns per dialogue session.
Figure 4 shows the learning curves of all four agents trained against the simulated user.Each learning curve was averaged over 5 runs.Table 1 shows the test performance where each number was averaged over 5 runs and each run generated 2000 simulated dialogues.We find that the HRL agents generated higher success rates and needed fewer conversation turns to achieve the users' goals than the rule-based agent and the flat RL agent.The performance of the m-HRL agent is tied with that of the h-HRL agent, even though the latter requires high-quality subgoals designed by human experts.

Human Evaluation
We further evaluated the agents that were trained on simulated users against real users, who were recruited from the authors' organization.We conducted a study using the one RL agent and two HRL agents {RL, h-HRL, m-HRL}, and compared two pairs: {RL, m-HRL} and {h-HRL, m-HRL}.In each dialogue session, one agent was randomly selected from the pool to interact with a user.The user was not aware of which agent was selected to avoid systematic bias.The user was presented with a goal sampled from a usergoal corpus, then was instructed to converse with the agent to complete the given task.At the end of each dialogue session, the user was asked to give a rating on a scale from 1 to 5 based on the natural- ness and coherence of the dialogue; here, 1 is the worst rating and 5 the best.In total, we collected 196 dialogue sessions from 10 human users.
Figure 5 summarizes the performances of these agents against real users in terms of success rate.Figure 6 shows the distribution of user ratings for each agent.For these two metrics, both HRL agents were significantly better than the flat RL agent.Another interesting observation is that the m-HRL agent performs similarly to the h-HRL agent in terms of success rate in the real user study as shown on Figure 5. Meanwhile in Figure 6, the h-HRL agent is significantly better than m-HRL agent in terms of real user ratings.This may be caused by the probabilistic termination of subgoals: we used a threshold strategy to decide whether to terminate a subgoal.This could introduce variance so the agent might not behave reasonably compared with human-defined subgoals which terminate deterministically.

Subgoal Visualization
Table 2 shows the subgoals discovered by SDN in a sample dialogue by a rule-based agent interacting with the simulated user.The rule-based agent is equipped with a human-defined subtask structure, which always solves subtask flight (turns 1-15) before hotel (turns 16-23), as shown in the first column.At turn 10, the user starts to talk about hotel while the rule-based agent is still working on the pre-defined, unfinished flight subtask until subtask flight is finished at turn 15.At turn 16, the user switches to hotel, and so does the rule-based agent until the end of the dialogue.For this rulebased agent, the human-defined subgoal (flight) terminates at turn 15.Meanwhile, our SDN model detected two subgoals (except for the final goal):  one terminating at turn 9 (Subgoal 1), and another terminating at turn 15 (Subgoal 2).Subgoal 2 is consistent with the human-defined subgoal.Subgoal 1 is also reasonable since the user tries to switch to hotel at turn 10.In Appendix B, Table 3 shows a sample dialogue session by m-HRL agent interacting with a real user.

Related Work
Task-completion dialogue systems have attracted numerous research efforts, and there is growing interest in leveraging reinforcement learning for policy learning.One line of research is on single-domain task-completion dialogues with flat deep reinforcement learning algorithms such as DQN (Zhao and Eskenazi, 2016;Li et al., 2017;Peng et al., 2018), actor-critic (Peng et al., 2017a;Liu and Lane, 2017) and policy gradients (Williams et al., 2017;Liu et al., 2017).Another line of research addresses multi-domain dialogues where each domain is handled by a separate agent (Gašić et al., 2015;Gašić et al., 2015;Cuayáhuitl et al., 2016).Recently, Peng et al. (2017b) presented a composite task-completion dialogue system.Unlike multi-domain dialogue systems, composite tasks introduce inter-subtask constraints.As a result, the completion of a set of individual subtasks does not guarantee the solution of the entire task.Cuayáhuitl et al. (2010) applied HRL to dialogue policy learning, although they focus on problems with a small state space.
Later, Budzianowski et al. (2017) used HRL in multidomain dialogue systems.Peng et al. (2017b) first presented an HRL agent with a global state tracker to learn the dialogue policy in the composite taskcompletion dialogue systems.All these works are built based on subgoals that were pre-defined with human domain knowledge for the specific tasks.The only job of the policy learner is to learn a hierarchical dialogue policy, which leaves the subgoal discovery problem unsolved.In addition to the applications in dialogue systems, subgoal is also widely studied in the linguistics research community (Allwood, 2000;Linell, 2009).
In the literature, researchers have proposed algorithms to automatically discovery subgoals for hierarchical RL.One large body of work is based on analyzing the spatial structure of the state transition graphs, by identifying bottleneck states or clusters, among others (Stolle and Precup, 2002;McGovern and Barto, 2001;Mannor et al., 2004;S ¸ims ¸ek et al., 2005;Entezari et al., 2011;Bacon, 2013).Another family of algorithms identifies commonalities of policies and extracts these partial policies as useful skills (Thrun and Schwartz, 1994;Pickett and Barto, 2002;Brunskill and Li, 2014).While similar in spirit to ours, these methods do not easily scale to continuous problems as in dialogue systems.More recently, researchers have proposed deep learning models to discover subgoals in continuous-state MDPs (Bacon et al., 2017;Machado et al., 2017;Vezhnevets et al., 2017).It would be interesting to see how effective they are for dialogue management.
Segmental structures are common in human languages.In the NLP community, some related research on segmentation includes word segmentation (Gao et al., 2005;Zhang et al., 2016) to divide the words into meaningful units.Alternatively, topic detection and tracking (Allan et al., 1998;Sun et al., 2007) segment a stream of data and identify stories or events in news or social text.In this work, we formulate subgoal discovery as a trajectory segmentation problem.Section 3.1 presents our approach to subgoal discovery which is inspired by a probabilistic sequence segmentation model (Wang et al., 2017).
We have proposed the Subgoal Discovery Network to learn subgoals automatically in an unsupervised fashion without human domain knowledge.Based on the discovered subgoals, we learn the dialogue policy for complex task-completion dialogue agents using HRL.Our experiments with both simulated and real users on a composite task of travel planning, show that an agent trained with automatically discovered subgoals performs competitively against an agent with human-defined subgoals, and significantly outperforms an agent without subgoals.Through visualization, we find that SDN discovers reasonable, comprehensible subgoals given only a small amount of suboptimal but successful dialogue state trajectories.
These promising results suggest several directions for future research.First, we want to integrate subgoal discovery into dialogue policy learning rather than treat them as two separate processes.Second, we would like to extend SDN to identify multi-level hierarchical structures among subgoals so that we can handle more complex tasks than those studied in this paper.Third, we would like to generalize SDN to a wide range of complex goal-oriented tasks beyond dialogue, such as the particularly challenging Atari game of Montezuma's Revenge (Kulkarni et al., 2016).

Figure 3 :
Figure3: Illustration of SDN for state trajectory (s 0 , . . ., s 5 ) with s 2 , s 4 and s 5 as subgoals.Symbol # is the termination.The top-level RNN (RNN1) models segments and the low-level RNN (RNN2) provides information about previous states from RNN1.The embedding matrix M maps the outputs of RNN2 to low dimensional representations so as to be consistent with the input dimensionality of RNN1.Note that state s 5 is associated with two termination symbols #; one is for the termination of the last segment and the other is for the termination of the entire trajectory.

Figure 5 :
Figure 5: Performance of three agents tested with real users: success rate, number of dialogues and p-value are indicated on each bar (difference in mean is significant with p < 0.05).

Figure 6 :
Figure 6: Distribution of user ratings for three agents in human evaluation Input: A trained SDN M, initial state s0 of an episode, threshold p, the HRL agent A. 1: Initialize an RNN2 instance R2 with parameters from M and s0 as the initial input.2: Initialize an RNN1 instance R1 with parameters from M and M • softmax(o RNN2

Table 1 :
Performance of agents with simulated user.
as hidden units and set the hidden size to 50.We set embedding matrix M with D = 4 columns.As we discussed in Section 3.1, the HRL training, we use the learned SDN to propose subgoal-completion queries.In our experiment, we set the maximum turn L = 60.We collected N = 1634 successful, but imper-

Table 2 :
Discovered subgoals (except for the final goal) in a sample dialogue by a rule-based agent interacting with user simulator.The left column (h-Task) shows the human-defined subtasks for the rule-based agent.