Feudal Reinforcement Learning for Dialogue Management in Large Domains

Reinforcement learning (RL) is a promising approach to solve dialogue policy optimisation. Traditional RL algorithms, however, fail to scale to large domains due to the curse of dimensionality. We propose a novel Dialogue Management architecture, based on Feudal RL, which decomposes the decision into two steps; a first step where a master policy selects a subset of primitive actions, and a second step where a primitive action is chosen from the selected subset. The structural information included in the domain ontology is used to abstract the dialogue state space, taking the decisions at each step using different parts of the abstracted state. This, combined with an information sharing mechanism between slots, increases the scalability to large domains. We show that an implementation of this approach, based on Deep-Q Networks, significantly outperforms previous state of the art in several dialogue domains and environments, without the need of any additional reward signal.


Introduction
Task-oriented Spoken Dialogue Systems (SDS), in the form of personal assistants, have recently gained much attention in both academia and industry. One of the most important modules of a SDS is the Dialogue Manager (DM) (or policy), the module in charge of deciding the next action in each dialogue turn. Reinforcement Learning (RL) (Sutton and Barto, 1999) has been studied for several years as a promising approach to model dialogue management (Levin et al., 1998;Henderson et al., 2008;Pietquin et al., 2011;Casanueva et al., 2015;Su et al., 2016). However, as the dialogue state space increases, the number of possible trajectories needed to be ex- * Currently at PolyAI, inigo@poly-ai.com plored grows exponentially, making traditional RL methods not scalable to large domains.
Hierarchical RL (HRL), in the form of temporal abstraction, has been proposed in order to mitigate this problem (Cuayáhuitl et al., 2010(Cuayáhuitl et al., , 2016Peng et al., 2017). However, proposed HRL methods require that the task is defined in a hierarchical structure, which is usually handcrafted. In addition, they usually require additional rewards for each subtask. Space abstraction, instead, has been successfully applied to dialogue tasks such as Dialogue State Tracking (DST) (Henderson et al., 2014b), and policy transfer between domains (Gašić et al., , 2015Wang et al., 2015). For DST, a set of binary classifiers can be defined for each slot, with shared parameters, learning a general way to track slots. The policy transfer method presented in (Wang et al., 2015), named Domain Independent Parametrisation (DIP), transforms the belief state into a slot-dependent fixed size representation using a handcrafted feature function. This idea could also be applied to large domains, since it can be used to learn a general way to act in any slot.
In slot-filling dialogues, a HRL method that relies on space abstraction, such as Feudal RL (FRL) (Dayan and Hinton, 1993), should allow RL scale to domains with a large number of slots. FRL divides a task spatially rather than temporally, decomposing the decisions in several steps and using different abstraction levels in each sub-decision. This framework is especially useful in RL tasks with large discrete action spaces, making it very attractive for large domain dialogue management.
In this paper, we introduce a Feudal Dialogue Policy which decomposes the decision in each turn into two steps. In a first step, the policy decides if it takes a slot independent or slot dependent action. Then, the state of each slot sub-policy is abstracted to account for features related to that slot, and a primitive action is chosen from the previously selected subset. Our model does not require any modification of the reward function and the hierarchical architecture is fully specified by the structured database representation of the system (i.e. the ontology), requiring no additional design.

Background
Dialogue management can be cast as a continuous MDP  composed of a continuous multivariate belief state space B, a finite set of actions A and a reward function R(b t , a t ). At a given time t, the agent observes the belief state b t ∈ B, executes an action a t ∈ A and receives a reward r t ∈ R drawn from R(b t , a t ). The action taken, a, is decided by the policy, defined as the function π(b) = a. For any policy π and b ∈ B, the Q-value function can be defined as the expected (discounted) return R, starting from state b, taking action a, and then following policy π until the end of the dialogue at time step T : The objective of RL is to find an optimal policy π * , i.e. a policy that maximizes the expected return in each belief state. In Value-based algorithms, the optimal policy can be found by greedily taking the action which maximises Q π (b, a).
In slot-filling SDSs the belief state space B is defined by the ontology, a structured representation of a database of entities that the user can retrieve by talking to the system. Each entity has a set of properties, refereed to as slots S, where each of the slots can take a value from the set V s . The belief state b is then defined as the concatenation of the probability distribution of each slot, plus a set of general features (e.g. the communication function used by the user, the database search method...) (Henderson et al., 2014a). The set A is defined as a set of summary actions, where the actions can be either slot dependent (e.g. request(food), confirm(area)...) or slot independent 1 (e.g. hello(), inform()...).
The belief space B is defined by the ontology, therefore belief states of different domains will have different shapes. In order to transfer 1 We include the summary actions dependent on all the slots, such as inform(), in this group. knowledge between domains, Domain Independent Parametrization (DIP) (Wang et al., 2015) proposes to abstract the belief state b into a fixed size representation. As each action is either slot independent or dependent on a slot s, a feature function φ dip (b, s) can be defined, where s ∈ S∪s i and s i stands for slot independent actions. Therefore, in order to compute the policy, where s is the slot associated to action a. Wang et al. (2015) presents a handcrafted feature function φ dip (b, s). It includes the slot independent features of the belief state, a summarised representation of the joint belief state, and a summarised representation of the belief state of the slot s. Section 4 gives a more detailed description of the φ dip (b, s) function used in this work.

Feudal dialogue management
FRL decomposes the policy decision π(b) = a in each turn into several sub-decisions, using different abstracted parts of the belief state in each subdecision. The objective of a task oriented SDS is to fulfill the users goal, but as the goal is not observable for the SDS, the SDS needs to gather enough information to correctly fulfill it. Therefore, in each turn, the DM can decompose its decision in two steps: first, decide between taking an action in order to gather information about the user goal (information gathering actions) or taking an action to fulfill the user goal or a part of it (information providing actions) and second, select a (primitive) action to execute from the previously selected subset. In a slot-filling dialogue, the set of information gathering actions can be defined as the set of slot dependent actions, while the set of information providing actions can be defined as the remaining actions.
The architecture of the feudal policy proposed by this work is represented schematically in Figure  1. The (primitive) actions are divided between two subsets; slot independent actions A i (e.g. hello(), inform()); and slot dependent actions A d (e.g. request(), confirm()) 2 . In addition, a set of master where a m i corresponds to taking an action from A i and a m d to taking an action from A d . Then, a feature function φ s (b) = b s is defined for each slot s ∈ S, as well as a slot independent feature function φ i (b) = b i and a master feature function φ m (b) = b m . These feature functions can be handcrafted (e.g. the DIP feature function introduced in section 2) or any function approximator can be used (e.g. neural networks trained jointly with the policy).
Finally, a master policy π m (b m ) = a m , a slot independent policy π i (b i ) = a i and a set of slot specific policies π s (b s ) = a d , one for each s ∈ S, are defined, where a m ∈ A m , a i ∈ A i and a d ∈ A d . Contrary to other feudal policies, the slot specific sub-policies have shared parameters, in order to generalise between slots (following the idea used by Henderson et al. (2014b) for DST). The differences between the slots (size, value distribution...) are accounted by the feature function φ s (b). Therefore π m (b m ) is defined as: If π m (b m ) = a m i , the sub-policy run is π i : Else, if π m (b m ) = a m d , π d is selected. This policy runs each slot specific policy, π s , for all s ∈ S, choosing the action-slot pair that maximises the Q function over all the slot sub-policies.
Then, the summary action a is constructed by joining a d and s (e.g. if a d =request() and s=food, then the summary action will be request(food)). A pseudo-code of the Feudal Dialogue Policy algorithm is given in Appendix A.

Experimental setup
The models used in the experiments have been implemented using the PyDial toolkit  3 and evaluated on the PyDial benchmarking environment . This environment presents a set of tasks which span different size domains, different Semantic Error Rates (SER), and different configurations of action masks and user model parameters (Standard (Std.) or Unfriendly (Unf.)). Table 1 shows a summarised description of the tasks. The models developed in this paper are compared to the state-ofthe-art RL algorithms and to the handcrafted policy presented in the benchmarks.

DIP-DQN baseline
An implementation of DIP based on Deep-Q Networks (DQN) (Mnih et al., 2013) is implemented as an additional baseline (Papangelis and Stylianou, 2017). This policy, named DIP-DQN, uses the same hyperparameters as the DQN implementation released in the PyDial benchmarks. A DIP feature function based in the description in (Wang et al., 2015) is   . . These policies have the same hyperparameters as the baseline DQN implementation, except for the two hidden layer sizes, which are reduced to 130 and 50 respectively. As feature functions, subsets of the DIP features are used: The original set of summary actions of the benchmarking environment, A, has a size of 5 + 3 * |S|, where |S| is the number of slots. This set is divided in two subsets 4 : A i of size 6 and A d of size 4. Each sub-policy (including π m ) is trained with the same sparse reward signal used in the baselines, getting a reward of 20 if the dialogue is successful or 0 otherwise, minus the dialogue length.

Results
The results in the 18 tasks of the benchmarking environment after 4000 training dialogues are presented in Table 2. The same evaluation procedure of the benchmarks is used, presenting the mean over 10 different random seeds and testing every seed for 500 dialogues. The FDQN policy substantially outperforms every other other policy in all the environments except Env. 1. The Figure 2: Learning curves for Feudal-DQN and DIP-DQN in Env. 4, compared to the two best performing algorithms in ) (DQN and GP-Sarsa). The shaded area depicts the mean ± the standard deviation over ten random seeds.
performance increase is more considerable in the two largest domains (SFR and LAP), with gains up to 5 points in accumulated reward in the most challenging environments (e.g. Env. 4 LAP), compared to the best benchmarked RL policies (Bnch.). In addition, FDQN consistently outperforms the handcrafted policy (Hdc.) in environments 2 to 6, which traditional RL methods could not achieve. In Env. 1, however, the results for FDQN and DIP-DQN are rather low, specially for DIP-DQN. Surprisingly, the results in Env. 2, which only differs from Env. 1 in the absence of action masks (thus, in principle, is a more complex environment), outperform every other algorithm. Analysing the dialogues individually, we could observe that, in this environment, both policies are prone to "overfit" to an action 5 . The performance of FDQN and DIP-DQN in Env. 4 is also better than in Env. 3, while the difference between these environments also lies in the masks. This suggests that an specific action mask design can be helpful for some algorithms, but can harm the performance of others. This is especially severe in the DIP-DQN case, which shows good performance in some challenging environments, but it is more unstable and prone to overfit than FDQN. However, the main purpose of action masks is to reduce the number of dialogues needed to train a policy. Observing the learning curves shown in Figure 2, the FDQN model can learn a nearoptimal policy in large domains in about 1500 dialogues, even if no additional reward is used, making the action masks unnecessary.

Conclusions and future work
We have presented a novel dialogue management architecture, based on Feudal RL, which substantially outperforms the previous state of the art in several dialogue environments. By defining a set of slot dependent policies with shared parameters, the model is able to learn a general way to act in slots, increasing its scalability to large domains.
Unlike other HRL methods applied to dialogue, no additional reward signals are needed and the hierarchical structure can be derived from a flat ontology, substantially reducing the design effort.
A promising approach would be to substitute the handcrafted feature functions used in this work by neural feature extractors trained jointly with the policy. This would avoid the need to design the feature functions and could be potentially extended to other modules of the SDS, making textto-action learning tractable. In addition, a single model can be potentially used in different domains (Papangelis and Stylianou, 2017), and different feudal architectures could make larger action spaces tractable (e.g. adding a third subpolicy to deal with actions dependent on 2 slots).  This section gives a detailed description of the DIP feature functions φ dip (b, s) = ψ 0 (b) ⊕ ψ j (b) ⊕ ψ d (b, s) used in this work. The differences with the features used in (Wang et al., 2015) and (Papangelis and Stylianou, 2017) are the following:

A Feudal Dialogue Policy algorithm
• No priority or importance features are used.
• No Potential contribution to DB search features are used.
• The joint belief features ψ j (b) are extended to account for large-domain aspects.  Table 3: List of features composing the DIP features. the tag (bin) denotes that a binary encoding is used for this feature. Some of the joint features ψ j (b) are extracted from the joint belief b j , computed as the Cartesian product of the beliefs of the individual slots. * denotes that these features exist in the original belief state b.