Sub-domain Modelling for Dialogue Management with Hierarchical Reinforcement Learning

Human conversation is inherently complex, often spanning many different topics/domains. This makes policy learning for dialogue systems very challenging. Standard flat reinforcement learning methods do not provide an efficient framework for modelling such dialogues. In this paper, we focus on the under-explored problem of multi-domain dialogue management. First, we propose a new method for hierarchical reinforcement learning using the option framework. Next, we show that the proposed architecture learns faster and arrives at a better policy than the existing flat ones do. Moreover, we show how pretrained policies can be adapted to more complex systems with an additional set of new actions. In doing that, we show that our approach has the potential to facilitate policy optimisation for more sophisticated multi-domain dialogue systems.


Introduction
The statistical approach to dialogue modelling has proven to be an effective way of building conversational agents capable of providing required information to the user (Williams and Young, 2007;Young et al., 2013). Spoken dialogue systems (SDS) usually consist of various statistical components, dialogue management being the central one. Optimising dialogue management can be seen as a planning problem and is normally tackled using reinforcement learning (RL). Many approaches to policy management over single domains have been proposed over the last years with ability to learn from scratch (Fatemi et al., 2016;Gašić and Young, 2014;Su et al., 2016;Williams and Zweig, 2016).
The goal of this work is to propose a coherent framework for a system capable of managing con-versations over multiple dialogue domains. Recently, a number of frameworks were proposed for handling multi-domain dialogue as multiple independent single-domain sub-dialogues (Lison, 2011;Wang et al., 2014;. Cuayáhuitl et al. (2016) proposed a network of deep Q-networks with an SVM classifier for domain selection. However, such frameworks do not scale to modelling complex conversations over large state/action spaces, as they do not facilitate conditional training over multiple domains. This inhibits their performance, as domains often share sub-tasks where decisions in one domain influence learning in the other ones.
In this paper, we apply hierarchical reinforcement learning (HRL) ( Barto and Mahadevan, 2003) to dialogue management over complex dialogue domains. Our system learns how to handle complex dialogues by learning a multi-domain policy over different domains that operate on independent time-scales with temporally-extended actions.
HRL gives a principled way for learning policies over complex problems. It overcomes the curse of dimensionality which plagues the majority of complex tasks by reducing them to a sequence of sub-tasks. It also provides a learning framework for managing those sub-tasks at the same time (Dietterich, 2000;Sutton et al., 1999b;Bacon et al., 2017).
Even though the first work on HRL dates back to the 1970s, its usefulness for dialogue management is relatively under-explored. A notable exception is the work of Cuayáhuitl (2009;2010), whose method is based on the MAXQ algorithm (Dietterich, 2000) making use of hierarchical abstract machines (Parr and Russell, 1998). The main limitation of this work comes from the tabular approach which prevents the efficient approximation of the state space and the objective function. This is crucial for scalability of spoken dia-logue systems to more complex scenarios. Parallel to our work, Peng et al. (2017) proposed another HRL approach, using deep Q-networks as an approximator. In separate work, we found deep Qnetworks to be unstable ; in this work, we focus on more robust estimators.
The contributions of this paper are threefold. First, we adapt and validate the option framework (Sutton et al., 1999b) for a multi-domain dialogue system. Second, we demonstrate that hierarchical learning for dialogue systems works well with function approximation using the GPSARSA algorithm. We chose the Gaussian process as the function approximator as it provides uncertainty estimates which can be used to speed up learning and achieve more robust performance. Third, we show that independently pre-trained domains can be easily integrated into the system and adapted to handle more complex conversations.

Hierarchical Reinforcement Learning
Dialogue management can be seen as a control problem: it estimates a distribution over possible user requests -belief states, and chooses what to say back to the user, i.e. which actions to take to maximise positive user feedback -the reward.
Reinforcement Learning The framework described above can be analyzed from the perspective of the Markov Decision Process (MDP). We can apply RL to our problem where we parametrize an optimal policy π : B × A → [0, 1]. The learning procedure can either directly look for the optimal policy (Sutton et al., 1999a) or model the Q-value function (Sutton and Barto, 1999): where r t is the reward at time t and 0 < γ ≤ 1 is the discount factor. Both approaches proved to be an effective and robust way of training dialogue systems online in interaction with real users (Gašić et al., 2011;Williams and Zweig, 2016).
Gaussian Processes in RL Gaussian Process RL (GPRL) is one of the state-of-the-art RL algorithms for dialogue modelling (Gašić and Young, 2014) where the Q-value function is approximated using Gaussian processes with a zero mean and chosen kernel function k(·, ·), i.e. Q(b, a) ∼ GP (0, k((b, a), (b, a))) . Gaussian processes follow a pure Bayesian framework, which allows one to obtain the posterior given a new collected pair (b, a). The trade-off between exploration and exploitation is handled naturally as given belief state b at the time t we can sample from posterior Q(b, a) over set of available actions A to choose the action with the highest sampled Q-value.
Hierarchical Policy Standard flat models where a single Markov Decision Process is responsible for solving multi-task problems have proven to be inefficient. These models have trouble overcoming the cold start problem and/or suffer from the curse of dimensionality (Barto and Mahadevan, 2003). This pattern was also observed with stateof-the-art models proposed recently (Mnih et al., 2013;Duan et al., 2016).
To overcome this issue, many frameworks have been proposed in the literature (Fikes et al., 1972;Laird et al., 1986;Parr and Russell, 1998). They make use of hierarchical control architectures and learning algorithms whereby specifying a hierarchy of tasks and reusing parts of the state space across many sub-tasks can greatly improve both learning speed and agent performance.
The key idea is the notion of temporal abstraction (Sutton et al., 1999b) where decisions at the given level are not required at each step but can call temporally-extended sub-tasks with their own policies.
The Option Framework One of the most natural generalisations of flat RL methods to com-plex tasks and easily interchangeable with primitive actions is the option model (Sutton et al., 1999b). The option is a generalisation of a singlestep action that might span across more than one time-step and can be used as a standard action.From mathematical perspective option is a tuple π, β, I that consists of policy π : S × A → [0, 1] which conducts the option, stochastic termination condition β : S → [0, 1] and an input set I ⊆ S which specifies when the option is available.
As we consider hierarchical architectures with temporally extended activities, we have to generalise the MDP to the semi-Markov Decision Process (SMDP) (Parr and Russell, 1998) where actions can take a variable amount of time to complete. This creates a division between primitive actions that span over only one action (and can be seen as a classic reinforcement learning approach) and composite actions (options) that involve an execution of a sequence of primitive actions. This introduces a policy µ over options that selects option o in state s with probability µ(s, o), o s policy might in turn select other options until o terminates and so on. The value function for option policies can be defined in terms of the value functions of the semi-Markov flat policies (Sutton et al., 1999b). Define the value function under a semi-Markov flat policy as: where E(π, s, t) is the event of π being initiated at time t in s. The value function for the policy over options µ can be defined as the value function for corresponding flat policy. This means we can apply off-the-shelf RL methods in HRL using different time-scales.

Hierarchical Policy Management
We propose a multi-domain dialogue system with a pre-imposed hierarchy that uses the option framework for learning an optimal policy. The user starts a conversation in one of the master domains and switches to the other domains (having satisfied his/her goal) that are seen by the model as sub-domains. To model individual policies, we can use any RL algorithm. In separate work, we found deep RL models performing worse in noisy environment . Thus, we employ the GPSARSA model from section 2 which proves to handle efficiently noise in the environment. The Algorithm 1 Hierarchical GPRL 1: Initialize dictionary sets DM, DS and policies πM, πS for master and sub-domains accordingly 2: for episode=1:N do 3: Start dialogue and obtain initial state b 4: while b is not terminal do 5: Choose action a according to πm 6: if a is primitive then 7: Execute a and obtain next state b 8: Obtain extrinsic reward re 9: else 10: Switch to chosen sub-domain 11: while b is not terminal or a terminates do 12: Choose action a according to πs 13: Obtain next state b 14: Obtain intrinsic reward ri 15: Store transition in Ds Update parameters with Dm, Ds system is trained from scratch where the system has to learn appropriate policy using both primitive and temporally extended actions. We consider two task-oriented master domains providing restaurant and hotel information for the Cambridge (UK) area. Having found the desired entity, the user can then book it for a specified amount of time or pay for it. The two domains have a set of primitive actions (such as request, confirm or inform ) and a set of composite actions (e.g., book, pay) which call sub-domains shared between them.
The Booking and Payment domains were created in a similar fashion: the user wants to reserve a table in a restaurant or a room in a hotel for a specific amount of money or duration of time. The system's role is to determine whether it is possible to make the requested booking. The subdomains operates only on primitive actions and it's learnt following standard RL framework. Figure 1 shows the analysed architecture: the Booking and Payment tasks/sub-domains are shared between two master domains. This means we can train general policies for those sub-tasks that adapt to the current dialogue given the information passed to them by the master domains.
Learning proceeds on two different time-scales. Following (Dietterich, 2000;Kulkarni et al., 2016), we use pseudo-rewards to train subdomains using an internal critic which assesses whether the sub-goal has been reached.
The master domains are trained using the reward signal from the environment. If a one-step option (i.e., a primitive action) is chosen, we ob-

Experiments
The PyDial dialogue modelling tool-kit  was used to evaluate the proposed architecture. The restaurant domain consists of approximately 100 venues with 3 search constraint slots while the hotel domain has 33 entities with 5 properties. There are 5 slots in the booking domain that the system can ask for while the payment domain has 3 search constraints slots. In the case of the flat approach, each master domain was combined with the sub-domains, resulting in 11 and 13 requestable slots for the restaurants and hotel domains, respectively.
The input for all models was the full belief state b, which expresses the distribution over the user intents and the requestable slots. The belief state has size 311, 156, 431 and 174 for the restaurants, hotels, booking and payment domains in the hierarchical approach. The flat models have input spaces of sizes 490 and 333 for the restaurant and hotel domains accordingly.
The proposed models were evaluated with an agenda-based simulated user (Schatzmann et al., 2006) where the user intent was perfectly captured in the dialogue belief state. For both intrinsic and extrinsic evaluation, the total return of each dialogue was set to 1(D) * 20 − T , where T is the dialogue length and 1(D) is the success indicator for dialogue D. Maximum dialogue length was set At the beginning of each dialogue, the master domain is chosen randomly and the user is given a goal which consists of finding an entity and either booking it (for a specific date) or paying for it. The user was allowed to change the goal with a small probability and could not proceed with the subdomains before achieving the master domain goal.

Hierarchical versus the Flat Approach
Following (Dietterich, 2000;Kulkarni et al., 2016), we apply a more exploratory policy in the case of master domains, allowing greater flexibility in managing primitive and composite actions during the initial learning stages. Figure 2 presents the results with 4000 training dialogues, where the policy was evaluated after each 200 dialogues.
The results validate the option framework: it learns faster and leads to a better final policy than the flat approach. The flat model did overcome the cold start problem but it could not match the performance of the hierarchical model. The policies learnt for sub-tasks with the flat approach perform only 10% worse (on average) than in the hierarchical case. However, providing the entity in both master domains has around 20% lower success rate compared to HRL.
Moreover, the flat model was not able to match the performance of the HRL approach even with more training dialogues. We let it run for another 6000 dialogues and did not observe any improvements in success rate (not reported here). This confirms the findings from other RL tasks -the flat approach is not able to remember successful strategies across different tasks (Peng et al., 2017;Duan et al., 2016). An example of two successful dialogues for both models is presented in the Figure 4.

Adaptation of Pretrained Policies
Following the idea of curriculum learning (Bengio et al., 2009), we test the adaptation capabilities of pre-trained policies to more complex situations. Adaptation has proven to be an effective way of reusing existing dialogue policies in new domains . Since the kernel function is factored into the kernel for the belief state space and the action space, we can consider them separately. Following  the action kernel function is defined only on actions that appear both in original and extended sets and defined 0 otherwise. The kernel for the belief state space is not changed as we operate on the same belief space.
We first train both master domains (without subgoals) until robust policies are learned. Subsequently, both master domains are re-trained in a hierarchical manner for 4000 dialogues (testing after each 200). Figure 3 shows the results compared to the policy learnt from scratch. Both policies trained on independent domains were able to adapt to more complicated tasks very quickly using the hierarchical framework with new options. This confirms that our approach can substantially speed up learning time by training a policy in a supervised way with the available data and then adapting it to more complex multi-task conversations.

Conclusion and Future Work
This paper introduced a hierarchical policy management model for learning dialogue policies which operate over composite tasks. The proposed model uses hierarchical reinforcement learning with the Gaussian Process as the function approximator. Our evaluation showed that our model learns substantially faster and achieves better performance than standard (flat) RL models. The natural next step towards the generalisation of this approach is to deepen the hierarchy and apply to more complex tasks.