Hyper-parameter Optimisation of Gaussian Process Reinforcement Learning for Statistical Dialogue Management

Gaussian processes reinforcement learning provides an appealing framework for training the dialogue policy as it takes into account correlations of the objec-tive function given different dialogue belief states, which can signiﬁcantly speed up the learning. These correlations are modelled by the kernel function which may depend on hyper-parameters . So far, for real-world dialogue systems the hyper-parameters have been hand-tuned, relying on the designer to adjust the correlations, or simple non-parametrised kernel functions have been used instead. Here, we examine different kernel structures and show that it is possible to optimise the hyper-parameters from data yielding improved performance of the resulting dialogue policy. We conﬁrm this in a real user trial.


Introduction
Spoken dialogue systems enable human-computer interaction via speech. The dialogue management component has two aims: to maintain the dialogue state based on the current spoken language understanding input and the conversation history, and choose a response according to its dialogue policy. To provide robustness to the input errors, a number of statistical approaches are proposed to track a distribution over all dialogue states at every dialogue turn, called the belief state (Young et al., 2013;Henderson et al., 2014;Sun et al., 2014). The system response is then based on the belief state, rather than an inaccurate estimate of the most likely dialogue state.
The state-of-art statistical methods for policy learning are based on reinforcement learning (RL) (Young et al., 2013), which makes it possible to learn from interaction with the users. However, most RL methods take too many dialogues for policy training. In Gaussian process reinforcement learning (GPRL) the kernel function defines prior correlations of the objective function given different belief states, which can significantly speeds up the policy optimisation (Gašić and Young, 2014). Alternative methods include Kalman temporal difference (KTD) reinforcement learning (Pietquin et al., 2011). Typically, statistical approaches to dialogue management rely on the belief state space compression into a form of a summary space, where the policy learning can be tractably performed (Williams and Young, 2007;Pinault et al., 2009;Crook and Lemon, 2011). GPRL allows the learning to be performed directly on the full belief state. However, only non-parametrised kernel functions have been considered for this purpose (Gašić and Young, 2014).
Here we address the important problem of how to define the structure of the kernel function for a real-world dialogue task and learn the hyperparameters from data for a policy that operates on the full belief state. Using only a small-size dataset for hyper-parameter optimisation, we show that the policy with the optimised kernel function outperforms both the policy with hand specified kernel parameters and the one with a standard nonparametrised kernel function. This is particularly beneficial for policy training with real users. This paper is organised as follows. In section 2, we briefly review GP-Sarsa and the hyperparameter optimisation. Section 3 introduces the kernel functions examined here. The experimental results are shown in section 4, followed by conclusions and future work directions in section 5.

GP-Sarsa and hyper-parameter optimisation
The expected cumulative reward given belief state b and action a is defined by the Q-function as: where r τ is the immediate reward at τ -th dialogue turn, T is the number of dialogue turns and γ ∈ [0, 1] is a discount factor. GP-Sarsa is an online RL algorithm that models the Q-function as a Gaussian process (Engel et al., 2005). It makes the learning tractable by utilising the kernel span sparcification algorithm and constructing a set of representative belief state and action called the dictionary. The computational complexity of GP-Sarsa is O(T m 2 ) where m is the size of the dictionary and T is the number of turns of all interactions.
In the case of Gaussian process regression, the kernel function parameters can be estimated by evidence maximisation in such a way that they capture the correlations that occur in the data (Rasmussen and Williams, 2005). This approach has been extended for the case of GP-Sarsa, however its benefits have so far only been shown for a toy dialogue problem (Gašić et al., 2010).
Using a data corpus of belief state-action pairs and rewards, the hyper-parameters can be found by minimising the negative log marginal likelihood via a conjugate gradient method (Rasmussen and Nickisch, 2010) to find the optimal hyperparameters. The computational complexity of the gradient calculation is O(nT 3 ), where n is the number of hyper-parameters and T is the total number of dialogue turns in the corpus (see appendix A).

Kernel functions
In Gaussian process regression, the kernel function k(·, ·) must be positive definite (Rasmussen and Williams, 2005). The kernel functions have some interesting properties (Duvenaud et al., 2013). If k 1 and k 2 are kernels, • k 1 + k 2 is kernel. Adding two kernels can be thought of as an OR-like operation, as the resulting kernel will have high value if either of the two base kernels have a high value.
• k 1 · k 2 is kernel. Multiplying two kernels can be thought of as an AND-like operation, as the function value is only expected to be similar to some other function value if both kernels have a high value.

Standard kernel functions
There are many valid kernel functions (Rasmussen and Williams, 2005). The following three ones are the basic kernels used in our experiments.
• Gaussian kernel: , where p and l are the hyper-parameters. If the distance between x i and x j is more than the lengthscale l, the outputs are uncorrelated. The output variance p 2 determines the average distance of the function from its mean.

Kernels for dialogue management
The kernel for two belief-action pairs is normally decomposed as the product of separate kernels over belief states and actions, k B (·, ·) and k A (·, ·).
The elements of the dialogue state are concepts that occur in the dialogue and are described by the underlying ontology. For instance, in a restaurant information domain, these usually include goalfood, goal-area, history-food, history-area and so on. The belief tracker maintains a probability distribution for each of them. We first define kernel function on each concept k B i (·, ·), then combine them to form the kernel function for the whole belief state k B (·, ·), as illustrated in Figure 1.
The Gaussian kernel and the linear kernel were used as basic kernel functions for each concept in the belief state. In the case of Gaussian kernels, if they share the same hyper-parameters across different concepts, we refer to them as concept independent, otherwise, they are called concept dependent. While the linear kernel is used for problems involving distributions (Jebara et al., 2004), Gaussian kernel is a more natural choice here as the Q-function for belief states is a non-linear smooth function. Additionally, we investigated two integration methods: one is to sum up the kernels for all concepts, the sum kernel; another is to multiply kernels for all concepts, the product kernel.
The action kernel is defined on summary actions and given that the total number of summary actions is usually small, e.g. 20, the δ-kernel is chosen as the action kernel.

Experiments and results
The training data for hyper-parameter optimisation was generated using an agenda-based user simulator (Schatzmann et al., 2007). The dialogue policy training and evaluation is performed both on the user simulator and human users. The system operates on the TopTable dialogue domain which consists of about 150 restaurants in Cambridge, UK (TopTable, 2012). Each restaurant has 9 slots, e.g. food, area, phone, address and so on. The state decomposes into 21 concepts. Each concept takes from 4 to 150 values. Each value is given a belief in [0, 1] by the BUDS state tracker . The summary action space consists of 20 summary actions.

Experimental procedure
The hyper-parameter are optimised as follows.
1. Training data generation: We used a random policy to generate simulated dialogues out of which a small number of successful dialogues were used as training data. 1 2. Interval estimation: We estimated appropriate intervals for concept independent hyperparameters according to the properties of Gaussian kernel as described in section 3.1.
In our experiments, we only restricted the range of lengthscale l. For the sum Gaussian kernel, the belief state for each concept is a probability distribution, so the lengthscale l is in interval (0, √ 2]. For product Gaussian kernel, the product of Gaussian kernels is still a Gaussian kernel, therefore the lengthscale l should be less than the maximum distance between two whole belief states (5.29 in the TopTable domain). 1 We used 73 dialogues for concept independent and 147 dialogues for concept dependent hyper-parameter optimisation, with respectively 505 and 1004 dialogue turns. We found that the smaller data set was not sufficient to capture correlations for the concept dependent kernels.

Name
Kernel 3. Concept independent hyper-parameter optimisation: We sampled initial concept independent hyper-parameters from the estimated intervals and then minimised the negative log likelihood to find the concept independent optimised hyper-parameters. We repeated this N times, and the hyperparameters with the overall smallest negative log likelihood was chosen as the final concept independent hyper-parameters.
4. Concept dependent hyper-parameter optimisation: We initialised them as concept independent hyper-parameters, then minimised the negative log likelihood to get the concept dependent optimised hyper-parameters.
After the hyper-parameters are obtained, we trained and evaluated the policies with these optimised kernels. For comparison, the policies with hand-tuned Gaussian kernel hyper-parameters and linear kernel were also trained and evaluated.

The results on the user simulator
During training, intermediate policies were recorded at every 1000 dialogues. Each policy was then evaluated using 1000 dialogues when testing. The reward was calculated as 20 for a successful dialogue, deducted for the number of turns.
We compared four different sum and product kernels (Table 1) and the results are given in Figure 2. The results in Figure 2(a) show that the policies with optimised sum Gaussian kernels perform significantly better than the policy using handtuned hyper-parameters (GHSI) and the linear kernel (LS). Also, in the later learning stages, the policy with concept dependent kernel (GLSD) appears to have reached a better performance than the one with concept independent kernel (GLSI).
The policies using the product kernels follow similar trends, except that the concept dependent product kernel (GLPD) performs significantly bet-  ter than other kernels at the initial stages of training (Figure 2(b)). 2 The best performing product (GLPD) and sum (GLSD) kernel converge to similar performance, with the product kernel performing better in the early stages of training, at the expense of a larger dictionary.

Human experiments
In order to further evaluate the effect of the optimised kernels, policies were trained using crowdsourcing via the Amazon Mechanical Turk service in a set-up similar to (Jurčíček et al., 2011;Su et al., 2015). At the end of each dialogue, a recurrent neural network (RNN) model was used to predict the dialogue success used as the reinforcement feedback (Su et al., 2015).
The GLSD kernel and the GLPD kernel were selected for on-line policy training and compared to the LS kernel. Figure 3 shows the learning curve of the reward during training, demonstrating the advantage of Gaussian kernels over the simple linear kernel. 2 The result of the product linear kernel (LP) is not reported due to poor performance. In the TopTable domain two belief states for history concepts are typically very different, so the linear kernel for these concepts is often close to 0. This results in a very small overall kernel value that leads to slow convergence.
To confirm this result, all optimised policies were evaluated after 500 training dialogues. The results are given in Table 2. It can be seen that the policies with optimised kernels, especially the GLPD kernel, perform much better than the policy with linear kernel.

Conclusions and Future work
This paper has investigated the problem of kernel structure and hyper-parameter optimisation of Gaussian process reinforcement learning for statistical dialogue management. We have demonstrated that the optimised kernels yield significant improvements in the policy performance both when training with a simulated user and real users. The work in this paper has focused on optimising the kernel function for the belief state space off-line. The future work will consider joint optimisation of the hyper-parameters and the policy. This will rely on finding a less computationally expensive method for hyper-parameter optimisation, also allowing more complex actions kernels to be investigated.