Knowledge transfer between speakers for personalised dialogue management

Model-free reinforcement learning has been shown to be a promising data driven approach for automatic dialogue policy optimization, but a relatively large amount of dialogue interactions is needed before the system reaches reasonable performance. Recently, Gaussian process based reinforcement learning methods have been shown to reduce the number of dialogues needed to reach optimal performance, and pre-training the policy with data gathered from different dialogue systems has further reduced this amount. Following this idea, a dialogue system designed for a single speaker can be initialised with data from other speakers, but if the dynamics of the speakers are very different the model will have a poor performance. When data gathered from different speakers is available, selecting the data from the most similar ones might improve the performance. We propose a method which automatically selects the data to transfer by deﬁning a similarity measure between speakers, and uses this measure to weight the inﬂuence of the data from each speaker in the policy model. The methods are tested by simulating users with different severities of dysarthria interacting with a voice enabled environmental control system.


Introduction
Partially observable Markov decision processes (POMDP)  are a popular framework to model dialogue management as a reinforcement learning (RL) problem. In a POMDP, a state tracker (Thomson and Young, 2010) (Williams, 2014) maintains a distribution over possible user goals (states), called the belief state, and RL methods (Sutton and Barto, 1998) are used to optimize a metric called cumulative reward, a score that combines dialogue success rate and dialogue length. However, existing model-based RL approaches become intractable for real world sized dialogue systems (Williams and Young, 2007), and model-free approaches often need a large number of dialogues to converge to the optimal policy (Jurčíček et al., 2012).
Recently, Gaussian process (GP) based RL (Engel et al., 2005) has been proposed for dialogue policy optimization, reducing the number of interactions needed to converge to the optimal policy by an order of magnitude with respect to other POMDP models, allowing the policy to be learned directly from real users interactions . In addition, using transfer learning methods (Taylor and Stone, 2009) to initialise the policy with data gathered from dialogue systems in different domains has increased the learning speed of the policy further , and provided an acceptable system performance when there is no domain specific data available. In the case of dialogue managers personalised for a single speaker, data gathered from other "source" speakers can be used to pre-train the policy, but if the dynamics of the other speakers are very different, this data will have a different distribution than the data of the current "target" speaker, and therefore, using this data to train the policy model does not have any benefit. In the context of speaker specific acoustic models for users with dysarthria (a speech impairment),  demonstrated that using a speaker similarity metric to select the data to train the acoustic models improves ASR performance. Taking this idea into dialogue management, if a similarity metric is defined between different speakers, this metric can be used to select which data from the source speakers is used to train the model, and even to weight the influence of the data from each speaker in the model. As GP-RL is a non-parametric method, a straightforward way to transfer knowledge is to directly initialise the GP model for the target speaker using data from source speakers, and update the GP with the data from the target speaker as this is gathered through interaction. But GP-RL soon becomes intractable as the data amount increases, limiting the amount of data that can be transferred.  proposes to transfer knowledge between domains by using the source data to train a prior GP, whose posterior is used as prior mean in the new GP. Another option is to use a GP approximation method (Quiñonero and Rasmussen, 2005) which permits data selection, use the speaker similarity metric to select the source data to initialise the policy, and then discard source data points as data points from the target speaker become available, keeping the number of data points up to a maximum.
This paper investigates knowledge transfer between speakers in the context of a spoken environmental control system personalised for speakers with dysarthria , where the ASR is adapted as speaker specific data is gathered (Christensen et al., 2012), thus improving the ASR performance with usage. The paper is organised as follows: Section 2 gives the background of GP-RL and defines the methods to select and weight the transferred data. Section 3 presents the experimental setup of the environmental control system and the different dysarthric simulated users, as well as the different features used to define the speaker similarities. In Section 4 the results of the experiments are presented and explained and Section 5 concludes the paper.

GPs for reinforcement learning
The objective of a POMDP based dialogue manager is to find the policy π(b) = a that maximizes the expected cumulative reward c i defined as the sum of immediate rewards from time step i until the dialogue is finished, where a ∈ A is the action taken by the manager, and the belief state b is a probability distribution over a discrete set of states S . The Q-function defines the expected cumulative reward when the dialogue is in belief state b i and action a i is taken, following policy π: where N is the time step at which the terminal action is taken (end of the dialogue), r i is the immediate reward given by the reward function, and 0 ≤ γ ≤ 1 is the discount factor, which weights future rewards. If c i is considered to be a random variable, it can be modelled as a mean plus a residual, c i = Q(b i , a i ) + ∆Q(b i , a i ). Then the immediate reward r i can be written recursively as the temporal difference (TD) between Q at time i and i + 1: where γ i = 0 if a i is a terminal action 1 , and the discount factor γ otherwise. Given a set of observed belief-action points (b i , a i ), with their respective r i values, the set of linear equations can be represented in matrix form as: , r t−1 = [r 1 , r 2 , ..., r t−1 ] and If the random variables q t are assumed to have a joint Gaussian distribution with zero mean and ∆Q(b i , a i ) ∼ N (0, σ 2 ), the system can be modelled as a GP (Rasmussen and Williams, 2005), with the covariance matrix determined by a kernel function defined independently over the belief and the action space (Engel et al., 2005): ) To simplify the notation, from now on x i = (b i , a i ) will be defined as each belief-action point, and K Y,Y as the matrix of size |Y| × |Y | whose elements are computed by the kernel function (eq. 4) between any set of points Y and Y . For a new belief-action point x * = (b * , a * ), the posterior of the expected cumulative reward can be computed: where X t is the set of size t of all the previously visited (b i , a i ) points, * denotes the set of size 1 composed by the new belief-action point to be evaluated and Σ t = σ 2 H t H t .Q andQ represent the mean and the variance of Q respectively. To further simplify the notation it is possible to redefine eq. 5 by defining a kernel in the temporal difference space instead of in the belief-action space. If the set of belief-action points X t is redefined 2 as Z t where z i = (b i , a i , b i+1 , a i+1 ), with b i+1 and a i+1 set to any default values if a i is a terminal action, a kernel function between 2 temporal difference points can be defined as: where k i,j is the kernel function in the beliefaction space (eq. 4) and γ i = 0 and γ j = 0 if a i and a j are terminal actions respectively, or the discount factor γ otherwise (as in eq. 2). When a i is a terminal action, the value of a i+1 and b i+1 in z i is irrelevant, as it will be multiplied by γ i = 0. In the same way, when this kernel is used to compute the covariance vector between a new test point and the set Z t , as the new point x * = (b * , a * ) lies in the belief-action space, it is redefined as z * = (b * , a * , b * +1 , a * +1 ) with b * +1 and a * +1 set to default values. Then, a * is considered a terminal action, so b * +1 and a * +1 won't affect the value of k td i * due to γ * = 0. A more detailed derivation of the temporal difference kernel is given in appendix A. Using the temporal difference kernel defined in eq. 6, eq. 5 can be rewritten as: Y,Y is the covariance matrix computed with the temporal difference kernel between any set of TD points Y and Y . With this notation, the shape of the equation for the posterior of Q is equivalent to classic GP regression models. Thus, it is straightforward to apply a wide range of well studied GP techniques, such as sparse methods. Redefining the belief-action set of points X t as the set of temporal difference points Z t also simplifies the selection of data points (e.g. to select inducing 2 Take into account that |Zt| = |Xt| − 1 points in sparse models), because the dependency between consecutive points is well defined.
The GP literature proposes various sparse methods which select a subset of inducing points U of size m < t from the set of training points Z (Quiñonero and Rasmussen, 2005). In this paper the deterministic training conditional (DTC) method is used. Once the subset of points has been selected and assuming (Engel et al., 2003), the GP posterior can be approximated in O(t · m 2 ) with the DTC method as: Once the posterior for any new belief-action point can be computed with eq. 7 or eq. 8, the policy π(b) = a can be computed as the action a that maximizes the Q-function from the current belief state b * , but in order to avoid getting stuck in a local optimum, an exploration-exploitation approach should be taken. One of the advantages of GPs is that they compute the uncertainty of the expected cumulative reward in form of a variance, which can be used as a metric for active exploration (Geist and Pietquin, 2011) to speed up the learning of the policy with an -greedy approach: (9) where controls the exploration rate. The policy optimization loop is performed following the Episodic GP-Sarsa algorithm defined by (Gašić and Young, 2014).

Transfer learning with GP-RL
The scenario where a statistical model for a specific "target" task must be trained, but only data from different but related "source" tasks is available, is known as transfer learning (Pan and Yang, 2010). In the context of this paper the different tasks will be dialogues with different speakers, and three points of transfer learning will be addressed: • How to transfer the knowledge • In the case of multiple source speakers, which data to transfer, and • How to weight data from different sources.
In the context of reinforcement learning (Taylor and Stone, 2009) and dialogue policy optimization , transfer learning has been shown to increase the performance of the system in the initial stages of use and to speed up the policy learning, requiring a smaller amount of target data to reach the optimal policy.

Knowledge transfer
The most straightforward way to transfer the data in GP-RL is to initialise the set of temporal difference points Z t of the GP with the source points and then continue updating it with target data points as they are gathered through interaction. However, this approach has a few shortcomings. First, as GP-RLs complexity increases with the number of data points, the model might quickly become intractable if it is initialised with too many source points. Also, when data points from the target speaker are gathered through interaction, the source points may not improve the performance of the system, while increasing the model complexity. Second, as the computation of the variance for a new point depends on the number of close points already visited, the variance of the new belief-action points will be reduced by the effect of the source points close in the belief-action space. If the distribution of the source data points is unbalanced, the effectiveness of the policy of eq. 9 will be affected.  proposes to use the source points to train a prior GP, and use its posterior as mean function for the GP trained with the target points. With this approach, the mean of the posterior in eq. 7 will be modified as: where m(z * ) is the mean of the posterior of the Q-function given by the prior GP and m t = [m(z 0 ), ..., m(z t )] . If the DTC approach (eq. 8) is taken, the posterior Q-function mean becomes: This approach has the advantage of being computationally cheaper than the former method while modelling the uncertainty for new target points more accurately, but at the cost of not taking into account the correlation between source and target points, which might reduce the performance when there is a small amount of target data.
A third approach combines the two previous methods, using a portion of the transfer points to train a GP for the prior mean function, while the rest is used to initialise the set Z t of the GP that will be updated with target points. This method will be computationally cheaper than the first one while increasing the performance of the second method with a small amount of target data.

Transfer data selection
As non-parametric models, the complexity of GPs will increase with the number of data points, limiting the amount of source data that can be transferred. Additionally, if the points come from multiple sources, it is possible that the data distribution from some sources is more similar to the target speaker than others, hence transferring data from these sources will increase performance. We propose to extract a speaker feature vector s from each speaker and define a similarity function f (s, s ) between speakers (see sec. 3.4). The data can be selected by choosing the points from the source speakers more similar to the target.
With the DTC approach (eq. 8), a subset of inducing points U m must be selected. The most straightforward way is to select the most similar points to the speaker from the transferred points. As the user interacts with the system and target data points are gathered, these points may be used as inducing points. This approach acts like another layer of data selection; the reduced complexity will allow for the transfer of more source points, while using the target points as inducing points will mean that only the source points that lie in the same part of the belief-action space as the target points have influence on the model.

Transfer data weighting
When transferring data from multiple sources, the similarity between each source and the target speaker might be different. Thus the data from a source more similar to the target should have more influence in the model than less similar ones. As a GP is defined by computing covariances between data points through a kernel function, one way to weight the data from different sources is to extend the belief-action vector used to compute the covariance with the speaker feature vector s explained in the previous section as x i = (b i , a i , s i ), and then extend the kernel (eq. 4) by multiplying it by a new kernel in the speaker space k s as: By adding this extra space to the data points, the covariance between points will not only depend on the similarity between points in the belief-action space, but also in the speaker space, reducing the covariance between two points that lie in different parts of the speaker space. This approach will also help to partially deal with the variance computing problem of the first model in sec. 2.1.1, as the source points will lie on a different part of the speaker space than the new target points, thus having less influence in the variance computation.

Experimental setup
To test the system in a scenario with high variability between the dynamics of the speakers, the experiments are performed within the context of a voice-enabled control system designed to help speakers with dysarthria to interact with their home devices (TV, radio, lamps...), where the speakers have different severities of dysarthria (this is an instance of the homeService application ). The system has a vocabulary of 36 commands and is organised in a tree setup where each node in the tree represents either a device (e.g. "TV"), a property of that device (e.g. "channel"), or actions that trigger some change in one of the devices (e.g. "one", child of "channel", will change the TV to channel one). When the system transitions to one of the terminal nodes that trigger an action, the action associated with this node is performed, and subsequently the system returns to the root node. In the following experiments a dialogue will be considered finished when one of the terminal node actions is carried out. In the non-terminal nodes, the user may either speak one of the commands available in that node (defined by its children nodes) to transition to them, or say the meta-command "back" to return to its parent node. The ASR is configured to recognise single words, so there is no need for a language understanding system, as the concepts are just a direct mapping from the ASR output. A more detailed explanation of the system is given in  and two example dialogues are presented in Appendix B.

Simulated dysarthric users
In the homeService application, each system is personalised for a single speaker by adapting the ASR system's acoustic model as more data is gathered through interaction, thus increasing the accuracy of the ASR over time. In the following experiments, the system is tested by interacting with a set of simulated users with dysarthria, where each user interacts with a set of different ASR simulators, arising from the different amounts of data used to adapt the ASR. To train the ASR simulator for these users, data from a dysarthric speech database (UASpeech database (Kim et al., 2008)) has been used. Table 1 shows the characteristics of the 15 speakers of the database, and the ASR accuracy for each speaker in the 36 word vocabulary of the system without adaptation and adapted with 500 words from that speaker. Additionally, an intelligibility measure assessment is presented for each speaker as the percentage of words spoken by each speaker which are understood by unfamiliar speakers; these are shown in the second column in table 1.
The system is tested with 6 different simulated users trained with data from low and medium intelligibility 3 speakers. Each user interacts with 4 different ASRs, adapted with 0, 150, 300 and 500 words respectively. For a more detailed explanation of the simulated users configuration, the reader may refer to .

POMDP setup
Each non-terminal node in the tree is modelled as an independent POMDP where the state set S is the set of possible goals of the node and the action set A is the set of actions associated with each goal plus an "ask" action, which requests the user to repeat his last command. The reward function for all the POMDPs is -1 for the "ask" action, and +10 for each other action if it corresponds to the user goal, or -10 otherwise, and γ = 0.95. The state tracker is a logistic regression classifier (Pedregosa et al., 2011), where classes are the set of states S. The belief state b is computed as the posterior over the states given the last 5 observations (N-best lists with normalised confidence scores). For each speaker, the state tracker has been trained with data from the other 14 speakers.

Policy models
The DTC approach (eq. 8) is used to compute the Q-function for the policy (eq. 9) with Gaussian noise variance σ 2 = 5. The kernel over the belief space is a radial basis function kernel (RBF): with variance σ 2 k = 25 and lengthscale l 2 k = 0.5. The delta kernel is used over the action space: and the kernels over the speaker space are defined in section 3.4. The size of the inducing set U m is 500 and the maximum size of the TD points set Z t is 2000. Whenever a new data point is observed from the target speaker, it is added to the set of inducing points U m , and the first point of the set U m (which, due to the ordering done by data selection, corresponds to the least similar source point or to the oldest target point) is discarded from the inducing set. Whenever a new data point is observed and the size of the set of temporal difference points |Z t | = 2000, the first point of this set is discarded. Three variations of the DTC approach are used: • DTC: Equation 8 is used to compute the Q posterior for the policy (eq. 9) and the set of temporal difference points Z t is initialised with the source points.
• Prior: Equation 11 is used to compute the Q posterior for the policy (eq. 9) and the prior GP is trained with the source points.
• Hybrid: Equation 11 is used to compute the Q posterior for the policy (eq. 9), the prior GP is trained with half of the source points and the set of temporal difference points Z t is initialised with the other half.

Speaker similarities
To compute the similarities between speakers a vector of speaker features s must be extracted. Different kinds of features may be extracted, such as meta-data based features, acoustic features, features related to the ASR performance, etc. In this paper, we explore 3 different methods to extract s; • Intelligibility assessment: The intelligibility assessment for each speaker in the UASpeech database (table 1) can be used as a single dimensional feature.
• I-vectors: Martínez et al. (2013) showed that i-vectors (Dehak et al., 2011) can be used to predict the intelligibility of a dysarthric speaker. For each speaker, s is defined as a 400 dimensional vector corresponding to the mean i-vector extracted from each utterance from that speaker. For more information on the i-vector extraction and characteristics, refer to (Martínez et al., 2014).
• ASR accuracy: The performance statistics of the ASR (e.g. accuracy) can be used as speaker features. In this paper we use the accuracy per word (command), defining s as a 36 dimensional vector where each element is the ASR accuracy for each of the 36 commands. The kernel over the speaker space k s (eq. 12), is defined as an RBF kernel (eq. 13). This kernel is used both to compute the similarity between speakers in order to select data (section 2.1.2), and to weight the data from each source speaker (section 2.1.3). k s has variance σ 2 k = 1 and the lengthscale l 2 k varies depending on the features. For intelligibility features l 2 k = 0.5, for i-vectors l 2 k = 8.0 and for ASR accuracy features l 2 k = 4.0

Results
In the following experiments the reward is computed as -1 for each dialogue turn, +20 if the dialogue was successful 4 . The system has been tested with the 24 speaker-ASR pairs explained in section 3.1, and in the following figures, each plotted line is the average results for these 24 speaker-ASR pairs. As the behaviour of the simulated user and some data selection methods partially depend on random variables, each experiment has been initialised with four different seeds and all the results presented are the average of the four seeds tested over 500 dialogues. In all the experiments the data to initialise each POMDP is transferred from a pool of 4200 points corresponding to 300 points from each speaker in table 1 except the speaker being tested, where each data pool is different for each seed. Figure 1 compares the different policy models presented in section 3.3 using the intelligibility measure based similarity to select and weight the data. The dotted line named DTC-conv shows the performance of the DTC policy when trained until convergence with the target speaker by simulating 1200 sub-dialogues in each node. DTC-1000 and DTC-2000 show the performance of the basic DTC approach when 1000 and 2000 source points are transferred respectively. It can be observed that, transferring more points boosts the performance, but at the cost of increasing the complexity. pri-1000 and pri-2000 show the performance of the prior policy with 1000 and 2000 transfer points respectively. The success rate is above the DTC policy but the learning rate for the reward is slower. This might be because the small amount of target data points make the predictions of the Q-function given by the GP unreliable. Hyb-1000 and hyb-2000 show the performance of the hybrid model, showing the best behaviour on success rate after 100 dialogues, and for hyb-2000 even outperforming DTC-2000 in reward after 400 dialogues.
In figure 2 the different approaches to compute the speaker similarities for data selection and weighting presented in section 3.4 are compared, using the DTC model with 1000 transfer points (named DTC-1000 in the previous figure). DTC-int uses the intelligibility measure based features, DTC-iv the i-vector features and DTC-acc the ASR accuracy based features. DTC-iv outperforms the other two features, followed closely by DTC-acc. The performance of DTC-int is way below the other two metrics, suggesting that the information given by intelligibility assessments is a weak feature for source speaker selection (as it is done by humans, it might be very noisy). As DTC-acc uses information about the ASR statistics (which is the input for the dialogue manager), it might be expected that it will outperform the rest, but in this case a purely acoustic based measure such as the DTC-iv works better. The reason for this might be that these features are not correlated to the ASR performance, so hidden variables are used to better organise the data. To investigate the usefulness of similarity based data selection, two different data selection methods which do not weight the transferred data have been tried. DTC-randspk selects the ordering of the speakers from whom the data is transferred at random, and has a much worse performance than the similarity based method, but DTC-allspk selects the 1000 source points from all the speakers, selecting 1000 points at random from the pool of 4200 points and, as it can be seen, the reward obtained by this method is slightly better than with DTC-iv, even if the success rate is lower. This suggests that transferring points from more speakers rather than from just the closest ones is a better strategy, probably because points selected by this method are distributed more uniformly over the belief-action space. A method which does a trade-off between filling the belief-action space while selecting the most similar points could be a better option.
To further investigate the effect of selection and weighting of the data, figure 3 plots the results for the DTC policy model using the i-vector based similarity to weight the data but different data selection methods. iv-clo selects the closest speakers with respect to the i-vector metric, iv-randspk orders the speakers at random, and iv-allspk selects the 1000 transfer points from all the speakers but the tested one. As in the previous figure, selecting speakers by similarity works better than selecting speakers at random, but selecting the points from all the speakers and weighting them with the ivector metric outperforms all the previous meth- ods. This might be because weighting the data does a kind of data selection, as the data points from source speakers closer to the target will have more influence than the further ones, while transferring points from all the speakers covers a bigger part of the belief-action space. acc-allspk and allspk-uw show the results of weighting the data with the ASR accuracy metric and not weighting the data respectively, when selecting the data from all speakers. The accuracy metric performs worse than the i-vector metric once again, but it still outperforms not weighting the data, suggesting that data weighting works for different metrics. Finally iv-allspk-hyb plots the performance of the hybrid model when selecting the data from all the speakers and weighting it with the i-vector based similarity. Even if it is computationally cheaper, it outperforms iv-allspk after 100 dialogues, suggesting that with a good similarity metric and data selection method, the hybrid model in section 3.3 is the best option to take.

Conclusions
When transferring knowledge between speakers in a GP-RL based policy, weighting the data by using a similarity metric between speakers, and to a lesser extent, selecting the data using this similarity, improves the performance of the dialogue manager. By defining a kernel between temporal difference points and interpreting the Q-function as a GP regression problem where data points are in the TD space, sparse methods that allow the selection of the subset of inducing points such as DTC can be applied. In a transfer learning scenario, DTC permits a larger number of data points to be transferred and the selection of points collected from the target speaker as inducing points.
We showed that using part of the transferred data to train a prior GP for the mean function, Reward DTC-Conv iv-clo iv-allspk acc-allspk allspk-uw iv-allspk-hyb iv-randspk Figure 3: different transfer data selection methods compared and the rest to initialize the set of points of the GP, improves the performance of each of these approaches. Transferring data points from a larger number of speakers outperformed selecting the data points only from the more similar ones, probably because the belief-action space is covered better. This suggests that more complex data selection algorithms that trade-off between selecting the data points by similarity and covering more uniformly the belief-action space should be used. Also, increasing the amount of data transferred increased the performance, but the complexity increase of GP-RL limits the amount of data that can be transferred. More computationally efficient ways to transfer the data could be studied.
Of the three metrics based on speaker features tested (speaker intelligibility, i-vectors and ASR accuracy), i-vectors outperformed the rest. This suggest that i-vectors are a potentially good feature for speaker specific dialogue management and could be used in other tasks such as state tracking. ASR accuracy based metrics also outperformed the intelligibility based one, and as ASR accuracy and i-vector are uncorrelated features, a combination of them could give further improvement.
Finally, as the models were tested with simulated users in a hierarchically structured dialogue system (following the structure of the homeService application), future work directions include evaluating the policy models in a mixed initiative dialogue system and testing them with real users.