Feudal Dialogue Management with Jointly Learned Feature Extractors

Reinforcement learning (RL) is a promising dialogue policy optimisation approach, but traditional RL algorithms fail to scale to large domains. Recently, Feudal Dialogue Management (FDM), has shown to increase the scalability to large domains by decomposing the dialogue management decision into two steps, making use of the domain ontology to abstract the dialogue state in each step. In order to abstract the state space, however, previous work on FDM relies on handcrafted feature functions. In this work, we show that these feature functions can be learned jointly with the policy model while obtaining similar performance, even outperforming the handcrafted features in several environments and domains.


Introduction
In task-oriented Spoken Dialogue Systems (SDS), the Dialogue Manager (DM) (or policy) is the module in charge of deciding the next action in each dialogue turn. One of the most popular approaches to model the DM is Reinforcement Learning (RL) (Sutton and Barto, 1999), having been studied for several years (Levin et al., 1998;Williams and Young, 2007;Henderson et al., 2008;Pietquin et al., 2011;. However, as the dialogue state space increases, the number of possible trajectories needed to be explored grows exponentially, making traditional RL methods not scalable to large domains. Recently, Feudal Dialogue Management (FDM) (Casanueva et al., 2018) has shown to increase the scalability to large domains. This approach is based on Feudal RL (Dayan and Hinton, 1993), * Currently at PolyAI, inigo@poly-ai.com a hierarchical RL method that divides a task spatially rather than temporally, decomposing the decisions into several steps and using different levels of abstraction for each sub-decision. When applied to domains with large state and action spaces, FDM showed an impressive performance increase compared to traditional RL policies.
However, the method presented in Casanueva et al. (2018), named FDQN 1 , relied on handcrafted feature functions in order to abstract the state space. These functions, named Domain Independent Parametrisation (DIP) (Wang et al., 2015), are used to transform the belief of each slot into a fixed size representation using a large set of rules.
In this paper, we demonstrate that the feature functions needed to abstract the belief state in each sub-decision can be jointly learned with the policy. We introduce two methods to do it, based on feed forward neural networks and recurrent neural networks respectively. A modification of the original FDQN architecture is also introduced which stabilizes learning, avoiding overfitting of the policy to a single action. Policies with jointly learned feature functions achieve similar performance to those using handcrafted ones, with superior performance in several environments and domains.

Background
Dialogue management can be cast as a continuous MDP  composed of a finite set of actions A, a continuous multivariate belief state space B and a reward function R(b t , a t ). At a given time t, the agent observes the belief state b t ∈ B, executes an action a t ∈ A and receives a reward r t ∈ R drawn from R(b t , a t ). The action taken, a, is decided by the policy, defined as the function π(b) = a. The objective of RL is to find the optimal policy π * that maximizes the expected return R in each belief state, where R = T −1 τ =t γ (τ −t) r τ , γ is a discount factor, t is the current timestep and T is the terminal timestep.
There are 2 major approaches to model the policy, Policy-based and Value-based algorithms. In the former, the policy is directly parametrised by a function π(b; θ) = a, where θ are the parameters learned in order to maximise R. In the later, the optimal policy can be found by greedily taking the action which maximises the Q-value, Q π (b, a), defined as the expected R, starting from state b, taking action a, and then following policy π until the end of the dialogue at time step T : (1)

Feudal Dialogue Management
In FDM (Casanueva et al., 2018) (Fig. 1), the (summary) actions are divided in two subsets; slot independent actions A i (e.g. hello(), inform()); and slot dependent actions A d (e.g. request(), confirm()). In addition, a set of master actions A m = (a m i , a m d ) is defined, where a m i corresponds to taking an action from A i and a m d to taking an action from A d . The feudal dialogue policy, π(b) = a, decomposes the decision in each turn into two steps. In the first step, the policy decides to take either a slot independent or a slot dependent action. In the second step, the state of each sub-policy is abstracted to account for features related to that slot, and a primitive action is chosen from the previously selected subset. In order to abstract the dialogue state for each sub-policy, a feature function φ s (b) = b s is defined for each slot s ∈ S, as well as a slot independent feature function φ i (b) = b i and a master feature function Finally, a master policy π m (b m ) = a m , a slot independent policy π i (b i ) = a i and a slot dependent policy π d (b s |∀s ∈ S) = a d are defined, where a m ∈ A m , a i ∈ A i and a d ∈ A d . In FDQN, π m and π i are modelled as value-based policies. However, Policy-based models can be used to model π m and π i , as introduced in section 3.1. In order to generalise between slots, π d is defined as a set of slot specific policies π s (b s ) = a d , one for each s ∈ S. The slot specific policies have shared parameters, and the differences between slots are accounted by the abstracted dialogue state b s . π d runs each slot specific policy, π s , for all s ∈ S, choosing the action-slot pair that maximises the Q-value over all the slot sub- Then, the summary action a is constructed by joining a d and s (e.g. if a d =request() and s=food, then the summary action will be request(food)). A pseudo-code of the Feudal Dialogue Policy algorithm is given in Appendix B.
Argmax over slots and slot dependent primitives

Probability distribution over slot independent primitives
Probability distribution over master actions 2 , ( 3 2 , # 3 In order to abstract the state space, FDQN uses handcrafted feature functions φ i , φ m and φ s based on the Domain Independent Parametrisation (DIP) features introduced in Wang et al. (2015). These features include the slot independent parts of the belief state, a summarised representation of the joint belief state, and a summarised representation of the belief state of the slot s.

FDM with jointly learned feature extractors
In order to avoid the need to handcraft the feature functions φ i , φ m and φ s , two methods which jointly train the feature extractors and the policy model are proposed. FDQN, however, showed to be prone to get stuck in local optima 3 . When the feature functions are jointly learned, this problem will be exacerbated due to the need to learn extra parameters. In section 3.1, two methods to avoid getting stuck in local optima are presented.
3.1 Improved training stability FDQN showed to be prone to get stuck in local optima, overfitting to an incorrect action and continuously repeating it until the user runs out of patience. Appendix A shows an example of this problem. We propose two methods that combined help to reduce the overfitting, allowing the feature extractors to be learned jointly.
The belief state used in FDQN only contains information about the last system action. Therefore, if the system gets into a loop repeating the same action for every turn, the belief state cannot depict it. We propose to append the input to each subpolicy with a vector containing the frequencies of the actions taken in the current dialogue. This additional information can be used by the policy to detect these "overfitting loops" and select a different action.
Furthermore, Policy-based Actor Critic methods such as ACER (Wang et al., 2016;Weisz et al., 2018) have shown to be more stable during learning than Value-based methods. Since π d has to compare Q-values, the slot specific policies π s need to be Value-based. The master and slot independent policies, however, can be replaced by an Actor Critic policy, as shown in Figure 1. Section 5 shows that by doing this replacement the dialogue manager is able to learn better policies.

Jointly learned feature extractors
In order to abstract the state space into a slotdependent fixed length representation, FDQN uses DIP feature functions (Wang et al., 2015). These features, however, need to be hand engineered by the system designer. To reduce the amount of hand-design, we propose two feature extraction models that can be learned jointly with the policy. Figure 2 shows the two proposed models. The first one (a), named FFN in section 5, pads the belief state of the slot to the length of the largest slot and encodes it into a vector e s through a feed forward neural network. The second one (b), uses a recurrent neural network to encode the values of each slot into a fixed length representation e s . Each b s ∀s ∈ S is then constructed by concatenating the slot independent parts of the belief to the slot encoding e s . For the feature functions φ i and φ m ,  the slot independent parts of the belief are used directly as inputs to their respective policy models. During training, the errors of the policies can be backpropagated through the feature extractors, training the models by gradient descent.

Baselines
The feudal dialogue policy presented in (Casanueva et al., 2018) is used as a baseline, named FDQN in section 5. An implementation of FDQN using the action frequency features introduced in 3.1 is also presented, named FDQN+AF. In addition, the results of the handcrafted policy presented in  are also shown, named HDC.

Feudal ACER policy
The feudal policy proposed in section 3.1, named FACER, is implemented. This policy uses an ACER policy (Wang et al., 2016) for the slot independent and master policies, and a DQN policy (Mnih et al., 2013) for the slot specific policies. The hyperparameters of the ACER sub-policies are the same than in (Weisz et al., 2018), except for the 2 hidden layers sizes, which are reduced to 100 and 50 respectively. The hyperparameters of the DQN sub-policies are the same as FDQN.

Jointly learned feature extractors
The FDQN+AF and FACER policies are trained using the FFN and RNN feature extractors proposed in section 3.2, as well as with the DIP fea- Env. 3 CR 13.1 12.8 12.9 12.9 13.0 13.0 11,7 11.0 SFR 10.3 9.8 9.9 10.3 10.1 10.5 9,7 9.0 LAP 9.8 9.4 9.7 9.6 9.8 9.6 9,4 8.7 Env. 4 CR 11.9 10.8 11.3 11.9 12.0 12.3 11,1 11.0 SFR 11.2 7.7 10.0 10.6 10.6 10.9 10,0 9.0 LAP 9.9 -0.6 4.5 11.2 10.9 11.0 10,8 8.7 Env. 5 CR 11.1 10.4 11.0 11.0 11.3 11.2 10.4 9.3 SFR 7.5 6.5 6.5 7.8 7.2 6.8 7.1 6.0 LAP 6.8 7.3 6.5 6.6 6.8 6.5 6.0 5.3 Env. 6 CR 11.7 11.4 11.6 11.7 11.7 11.8 11.5 9.7 SFR 8.2 7.5 7.4 8.1 8.1 7.4 7,9 6.4 LAP 6.7 6.7 6.5 6.6 6.3 6.4 5.2 5.5 tures used in (Casanueva et al., 2018) . For each slot s ∈ S , b s is constructed by concatenating the general and the joint belief state 5 to the encoding of the slot e s generated by the feature extractor. The size of e s is 25. As input for the π m and π i policies, the general and joint belief state is used. Table 2 shows the average reward 6 after 4000 training dialogues in the 18 tasks of the PyDial benchmarks. The reward for each dialogue is defined as (suc * 20) − n, where n is the dialogue length and suc = 1 if the dialogue was successful or 0 otherwise. The results are the mean over 10 different random seeds, where every seed is tested for 500 dialogues. Comparing FDQN and FDQN-AF when using DIP features, the importance of including the action frequencies can be seen. The use of these features improves the reward in most of the tasks between 0.5 and 2 points. When training the policies with the joint feature extractors, the action frequencies were found to be a key feature in order to avoid the policies to get stuck in local optima.

Results
FACER shows the best performance with the jointly learned feature extractors, outperforming any other policy (including the ones using DIP 5 The joint belief state is sorted and truncated to size 20. 6 Because of space issues, the success rate is not included. However, the success rate is very correlated with the results presented in  and (Casanueva et al., 2018). features) in 8 out of 18 tasks, and obtaining a very similar performance in the rest. This shows the improved training stability given by the Policy-based models. In task 1, however, (where FDQN already showed overfitting problems) the FDQN+AF is able to learn better feature extractors than FACER, but the performance is still worse than HDC. Figure 3 shows the learning curves for FACER in two domains of Env. 3 using the two learned feature extractors (FFN and RNN) compared to the DIP features. It can be observed that the learned features take longer to converge, but the difference is smaller than it could be expected, especially in a large domain such as Laptops.

Conclusions and future work
This paper has shown that the feature functions needed to abstract the dialogue state space in feudal dialogue management can be jointly learned with the policy, thus reducing the need of handcrafting them. In order to make it possible to learn the features jointly, two methods to increase the robustness of the model against overfitting were introduced: extending the input features with action frequencies and substituting the master and domain independent policies by ACER policies. In combination, these modifications showed to improve the results in most of the PyDial benchmarking tasks by an average of 1 point in reward, while reducing the handcrafting effort.
However, as the original FDQN architecture needs to model the slot specific policies as Valuebased models, ACER policies could only be used for the master and slot independent policies. Future work will investigate new FDM architectures which allow the use of Policy-based models as slot specific policies, while maintaining the parameter sharing mechanism between slots.

A Dialogues getting stuck in local optima
In this section we present an example of a policy model getting stuck in a sub-optimal policy. The two following dialogues represent a dialogue observed in the initial training steps of the policy and a dialogue observed once the policy has overfitted.
Initial dialogue: Goal: food=british, area=centre In the initial dialogue, the policy interacts with a collaborative user 7 , which in line 3, provides more information than the requested by the policy. The dialogue ends up successfully and, therefore, the policy learns that by confirming the slot food in that dialogue state it will get enough information to end the dialogue successfully. In the second dialogue, however, the system interacts with a less collaborative user. Therefore, when confirming the slot food in line 3, it doesn't get the extra information obtained in the previous dialogue. The policy keeps insisting with this action, until the user runs out of patience and ends up the dialogue. Even with -greedy exploration, as a fraction of the sampled users will be collaborative enough to make this policy successful, the policy can get stuck in this local optima and never learn a better policy -i.e. requesting the value of the slot area. Other examples of overfitting include 7 The user parameters are sampled at the beginning of each dialogue. policies informing entities at random from the first turn (since some users will correct the policy by informing the correct values) or policies that don't learn to inform about the requested slots (since the sampled user goal sometimes doesn't include requesting any extra information, just the entity name).