Learning Domain-Independent Dialogue Policies via Ontology Parameterisation

This paper introduces a novel approach to eliminate the domain dependence of dialogue state and action representations, such that dialogue policies trained based on the proposed representation can be transferred across different domains. The experimental results show that the policy optimised in a restaurant search domain using our domain-independent representations can be deployed to a laptop sale domain, achieving a task success rate very close (96.4% relative) to that of the policy optimised on in-domain dialogues


Introduction
Statistical approaches to Spoken Dialogue Systems (SDS), particularly, Partially Observable Markov Decision Processes (POMDPs) (Young et al., 2013), have demonstrated great success in improving the robustness of dialogue policies to error-prone Automatic Speech Recognition (ASR). However, building statistical SDS (SSDS) for different application domains is time consuming. Traditionally, each component of such SSDS needs to be trained based on domain-specific data, which are not always easy to obtain. Moreover, in many cases, one will need a basic (e.g. rulebased) working SDS to be built before starting the data collection procedure, where developing the initial system for a new domain requires a significant amount of human expertise.
In this paper, we introduce a simple but effective approach to eliminate domain dependence of dialogue policies, by exploring the nature and commonness of the underlying tasks of SDS in different domains, and parameterising different slots defined in the domain ontologies into a common * ZW's present address is Baidu Inc., Beijing, China. feature space according to their relations and potential contributions to the underlying tasks. After the parameterisation, the resulting policy can be applied to different domains that realise a same abstract task (see §3.3 for examples).
Existing works on domain-extension/transfer for SDS include domain-independent intermediate semantic extractors for Spoken Language Understanding (SLU) (Li et al., 2014), domain-general rules (Wang and Lemon, 2013;Sun et al., 2014) and delexicalised deep classifiers (Henderson et al., 2014;Mrkšić et al., 2015) for dialogue state tracking, and domain-extensible/transferable statistical dialogue policies (Lemon et al., 2006;Gašić et al., 2013;. When compared to the closely related methods by Gašić et al. and Lemon et al. that manually tie slots in different domains, our approach provides a more flexible way to parametrically measure the similarity between different domain ontologies and directly addresses the nature of the underlying tasks. For the ease of access to the proposed technique ( §3), we start from a brief review of POMDP-SDS in §2. Promising experimental results are achieved based on both simulated users and human subjects as shown in §4, followed by conclusions ( §5).

POMDP-SDS: A Brief Overview
A POMDP is a powerful tool for modelling sequential decision making problems under uncertainty, by optimising the policy to maximise longterm cumulative rewards. Concretely. at each turn of a dialogue, a typical POMDP-SDS parses an observed ASR n-best list with confidence scores into semantic representations (again with associated confidence scores), and estimates a distribution over (unobservable) user goals, called a belief state. After this, the dialogue policy selects a semantic-level system action, which will be realised by Natural Language Generation (NLG) before synthesising the speech response to the user.
The semantic representations in SDS normally consist of two parts, a communication function (e.g. inform, deny, confirm, etc.) and (optionally) a list of slot-value pairs (e.g. food=pizza, area=centre, etc.). The prior knowledge defining the slot-values in a particular domain is called the domain ontology.
Dialogue policy optimisation can be solved via Reinforcement Learning (RL), where the aim is to estimate a quantity Q(b, a), for each b and a, reflecting the expected cumulative rewards of the system executing action a at belief state b, such that the optimal action a * can be determined for a given b according to a * = arg max a Q(b, a). Due the exponentially large state-action space an SDS can incur, function approximation is necessary, where it is assumed that Q(b, a) ≈ f θ (φ(b, a)). Here θ denotes the model parameter to be learnt, and φ(·) is a feature function that maps (b, a) to a feature vector. To compute Q(b, a), one can either use a low-dimensional summary belief (Williams and Young, 2005) or the full belief itself if kernel methods are applied (Gašić et al., 2012). But in both cases, the action a will be a summary action (see §3 for more details) to achieve tractable computations.

Domain-Independent Featurisation
For the convenience of further discussion, we firstly take a closer look at how summary actions can be derived from their corresponding master actions. Assume that according to its communication function, a system action a can take one of the following forms: a() (e.g. reqmore()), a(s) (e.g.
select(food=noodle, food=pizza)), and a(s offer(name="Chop Chop", food=Chinese)), where a stands for the communication function, s * and v * denote slots and values respectively. Usually it is unnecessary for the system to address a hypothesis less believable than the top hypothesis in the belief (or the top two hypotheses in the 'select' case). Therefore, by abstracting the actual values, the system actions can be represented as a s = b top s , a s = b top s , s = b second s and a b top joint , where b s denotes the marginal belief with respect to slot s, b joint stands for the joint belief, and b top * and b second * denote the top and second hypotheses of a given b * , respectively. After this, summary actions can be defined as a s (for actions depending on s) and a (for those having no operands or only taking joint hypotheses as operands, i.e. independent of any particular slot). Furthermore, one can uniquely map such summary actions back to their master actions, by substituting the respective top (and second if necessary) hypotheses in the belief into the corresponding slots.
Based on the above definition, we can re-write the master action a as a s , where s denotes the slot that a depends on when summarised. Here, s is fully derived from a and can be null (when the summary action of a is just a). Recalling the RL problem, conventionally, φ can be expressed where δ is the Kronecker delta, ⊗ is the tensor product, and generally speaking, ψ(·) featurises the belief state, which can yield a summary belief in particular cases.

"Focus-aware" belief summarisation
Without losing generality, one can assume that the communication functions a are domainindependent. However, since the slots s are domain-specific (defined by the ontology), both a s and b will be domain-dependent.
Making ψ(b) domain-independent can be trivial. A commonly used representation of b consists of a set of individual belief vectors, denoted as {b joint , b • } ∪ {b s } s∈S , where b • stands for the section of b independent of any slots (e.g. the belief over communication methods, such as "by constraint", "by name", etc. (Thomson and Young, 2010)) and S stands for the set of informable (see Appendix A) slots defined in the domain ontology. One can construct a feature func- where ⊕ stands for the operator to concatenate two vectors. (In other words, the belief summarisation here only focuses on the slot being addressed by the proposed action, regardless of the beliefs for the other slots.) As the mechanism in each ψ * to featurise its operand b * can be domainindependent (see §3.3 for an example), the resulting overall feature vector will be domain-general.

Ontology (slot) parameterisation
If we could further parameterise each slot s in a domain-general way (as ϕ(s)), and define the domain dependence of the overall feature function φ will be eliminated 1 . Note here, to make the definition more general, we assume that the feature functions ϕ a and ψ a depend on a, such that a different featurisation can be applied for each a.
To achieve a meaningful parameterisation ϕ a (s), we need to investigate how each slot s is related to the completion of the underlying task. More concretely, for example, if the underlying task is to obtain user's constraint on each slot so that the system can conduct a database (DB) search to find suitable entities (e.g. venues, products, etc.), then the slot features should describe the potentiality of the slot to refine the search results (reduce the number of matching entities) if that slot is filled. For another example, if the task is to gather necessary (plus optional) information to execute a system command (e.g. setting a reminder or planning a route), where the number of values of each slot can be unbounded, then the slots features should indicate whether the slot is required or optional. In addition, the slots may have some specific characteristics causing people to address them differently in a dialogue. For example, when buying a laptop, more likely one would talk about the price first than the battery rating. Therefore, features describing the priority of each slot are also necessary to yield natural dialogues. We give a complete list of features in §3.3 for a working example, to demonstrate how two unrelated domains can share a common ontology parameterisation.

A working example
We use restaurant search and laptop sale as two example domains to explain the above idea. The underlying tasks of the both problems here can be regarded as DB search. Appendix A gives the detailed ontology definitions of the two domains.
Firstly, the following notations are introduced for the convenience of discussion. Let V s denote the set of the values that a slot s can take, and |·| be the size of a set. Assume that h = (s 1 = v 1 ∧ . . . ∧ s n = v n ) is a user goal hypothesis consisting a set of slot-value pairs. We use DB(h) to denote the set of the entities in the DB satisfying h. In addition, we define x to be the largest integer less than and equal to x. After this, for each informable slot s defined in Table A.1, the following quantities are used for its parameterisation.
• Importance: two features describing, respectively, how likely a slot will and will not occur in a dialogue.
• Priority: three features denoting, respectively, how likely a slot will be the first, the second, and a later attribute to address in a dialogue.
• Value distribution in the DB: the entropy of the normalised histogram (|DB(s = v)|/|DB|) v∈Vs .
• Potential contribution to DB search: given the current top user goal hypothesis h * and a pre-defined threshold τ (= 12) -how likely filling s will reduce the number of matching DB records to below τ , i.e. |{v : v ∈ V s , |DB(h * ∧ s = v)| ≤ τ }| /|V s |; -how likely filling s will not reduce the number of matching DB records to below τ , i.e. |{v : v ∈ V s , |DB(h * ∧ s = v)| > τ }| /|V s |; -how likely filling s will result in no matching records found in the DB, i.e. |{v : v ∈ V s , DB(h * ∧ s = v) = ∅}| /|V s |.
The importance and priority features used in this work are manually assigned binary values, but ideally, if one has some in-domain human dialogue examples (e.g. from Wizard-of-Oz experiments), such feature values can be derived from simple statistics on the corpus. In addition, we make the last set of features only applicable to those slots not observed in the top joint hypothesis. The summary belief features used in this work are sketched as follows. For each informable slot s and each of its applicable action types a, ψ a (b, s) extracts the probability of b top s , the entropy of b s , the probability difference between the top two marginal hypotheses (discretised into 5 bins with interval size 0.2) and the non-zero rate (|{v : v ∈ V s , b s (v) > 0}|/|V s |). In addition, if the slot is requestable, the probability of it being requested by the user (Thomson and Young, 2010) is used as an extra feature. A similar featurisation procedure (except the "requested" probability) is applied to the joint belief as well, from which the obtained features are used for all communication functions.
To capture the nature of the underlying task (DB search), we define two additional features for the joint belief, an indicator [[|DB(b top joint )| ≤ τ ]] and a real-valued feature |DB(b top joint )|/τ if the former is false, where τ is the same pre-defined threshold used for slot parameterisation as introduced above. There are also a number of slot-independent features applied to all action types, including the belief over communication methods (Thomson and Young, 2010) and the marginal confidence scores of user dialogue act types in the current turn.

Experimental Results
In the following experiments, the proposed domain-independent parameterisation (DIP) method were integrated with a generic dialogue state tracker (Wang and Lemon, 2013) to yield an overall domain-independent dialogue manager. Firstly, we trained DIP dialogue policies in the restaurant search domain using GP-SARSA based on a state-of-the-art agenda-based user simulator 3 (Schatzmann et al., 2007), in comparison with the GP-SARSA learning process for the well-known BUDS system (Thomson and Young, 2010) (where full beliefs are used (Gašić and Young, 2014)), as shown in Figure 1. It can be found that the proposed method results in faster convergence and can even achieve slightly better performance than the conventional approach.
After this, we directly deployed the DIP poli-# of dialogues 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000 Success rate 0.8 0.9 1 BUDS GP-SARSA DIP GP-SARSA # of dialogues 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000  Table 1 shows that the performance of the transferred policy is almost identical to the in-domain policy. Finally, we chose the best in-domain and transferred DIP policies and deployed them into endto-end laptop sale SDSs, for human subject experiments based on MTurk. After each dialogue, the user was also asked to provide a subjective score for the naturalness of the interaction, ranging from 1 (very unnatural) to 6 (very natural). The results are summarised in Table 2, where the success rate difference (3%) between the in-domain policy and the transferred policy is statistically insignificant, and surprisingly, the users on average regard the transferred policy as slightly more natural than the in-domain policy.

Conclusion
This paper proposed a domain-independent ontology parameterision framework to enable domaintransfer of optimised dialogue policies. Experimental results show that when transferred to a new domain, dialogue policies trained based on the DIP representations can achieve very close performance to those policies optimised using indomain dialogues. Bridging the (very small) performance gap here should also be simple, if one takes the transferred policy as the prior and conducts domain-adaptation similar to . This will be addressed in our future work.