Addressing Objects and Their Relations: The Conversational Entity Dialogue Model

Statistical spoken dialogue systems usually rely on a single- or multi-domain dialogue model that is restricted in its capabilities of modelling complex dialogue structures, e.g., relations. In this work, we propose a novel dialogue model that is centred around entities and is able to model relations as well as multiple entities of the same type. We demonstrate in a prototype implementation benefits of relation modelling on the dialogue level and show that a trained policy using these relations outperforms the multi-domain baseline. Furthermore, we show that by modelling the relations on the dialogue level, the system is capable of processing relations present in the user input and even learns to address them in the system response.


Introduction
Data-driven statistical spoken dialogue systems (SDS) (Lemon and Pietquin, 2012;Young et al., 2013) are a promising approach for realizing spoken dialogue interaction between humans and machines. Up until now, these systems have successfully been applied to single-or multi-domain taskoriented dialogues Lison, 2011;Wang et al., 2014;Papangelis and Stylianou, 2017;Peng et al., 2017) where each dialogue is modelled as multiple independent single-domain sub-dialogues. However, this multi-domain dialogue model (MDDM) does not offer an intuitive way of representing multiple objects of the same type (e.g., multiple restaurants) or dynamic relations between these objects. To the best of our knowledge, neither problem has yet been addressed in statistical SDS research.
The goal of this paper is to propose a new dialogue model-the conversational entity dialogue model (CEDM)-which offers an intuitive way of modelling dialogues and complex dialogue structures inside the dialogue system. Inspired by Grosz (1978), the CEDM is centred around objects and relations instead of domains thus offering a fundamental change in how we think about statistical dialogue modelling. The CEDM allows • to model dynamic relations directly, independently and persistently so that the relations may be addressed by the user and the system, • the system to talk about multiple objects of the same type, e.g., multiple restaurants, while still allowing feasible policy learning. The remainder of the paper is organized as follows: after presenting a brief motivation and related work in Section 2, Section 3 presents background information on statistical SDSs. Section 4 contains the main contribution and describes the conversational entity dialogue model in detail. Looking at one aspect of the CEDM, the modelling of relations, Section 5 describes a prototype implementation and shows the benefits of the CEDM in experiments with a simulated user. Section 6 concludes the paper with a list of open questions which need to be addressed in future work.

Motivation and Related Work
To introduce the terminology that will be used in this work and to illustrate the necessity of adequate modelling of relations, Figure 1 shows an example dialogue about hotels and restaurants in Cambridge with the relation in the same area. Instead of talking about a sequence of domains, the system and the user talk about different objects and relations. Each part of the dialogue thus may be Figure 1: A dialogue between the system (S) and a user (U) about a restaurant and a hotel in the same area along with the mapping of fractions of the dialogue to the respective objects (of predefined types) and the relation. All objects and relations reside inside a conversational world. mapped to an object or a relation in the conversational world or may be mapped to the world itself (grey). In the example, the first part (blue) is about Object 1 of type hotel. When the focus shifts towards Object 2 of type restaurant (green) at U3, the user also addresses the relation (red) in the same area between Object 1 and Object 2.
Addressing a relation in this way could still be captured by the semantic interpretation of the user input as the information area=centre may be derived from the context. However, if the user said I need a hotel and a restaurant in Cambridge in the same area right in the beginning of the dialogue (U1), no context information would be available. To capture these dialogue structures, the dialogue model and the corresponding dialogue state must be able to represent them adequately.
The proposed CEDM achieves this by modelling state information about conversational entities instead of domains. More precisely, it models separate states about the objects (e.g., the hotel or restaurant) and the relations. Previous work on dialogue modelling already incorporated the idea of objects or entities to be the principal component of the dialogue state (Grosz, 1977;Bilange, 1991;Montoro et al., 2004;Xu and Seneff, 2010;Heinroth and Minker, 2013). However, these dialogue models are not based on statistical dialogue processing where a probability distribution over all dialogue states needs to be modelled and maintained. This additional complexity, though, cannot be incorporated in a straight-forward way into the proposed models. In contrast, the CEDM offers a comprehensive and consistent way of modelling these probabilities by defining and maintaining entity-based states. Work on statistical dialogue state modelling Lee and Stent, 2016;Schulz et al., 2017) also contain a variant of objects but is still based on the MDDM thus not offering any mechanism to model multiple entities or relations between objects. Ramachandran and Ratnaparkhi (2015) proposed a belief tracking approach using relational trees. However, they only consider static relations present in the ontology and are not able to handle dynamic relations.

Statistical Spoken Dialogue Systems
Statistical SDS are model-based approaches 1 and usually assume a modular architecture (see Fig. 2). The problem of learning the next system action is framed as a partially-observable Markov decision process (POMDP) that accounts for the uncertainty inherent in spoken communication. This uncertainty is modelled in the belief state b(s) representing a probability over all states s.
Reinforcement learning (RL) is used in such a sequential decision-making process where the decision-model (the policy π) is trained based on 1 Model-free approaches like end-to-end generative networks (Serban et al., 2016;Li et al., 2016) have interesting properties (e.g., they only need text data for training) but they still seem to be limited in terms of dialogue structure complexity (not linguistic complexity) in cases where content from a structured knowledge base needs to be incorporated. Approaches where incorporating this information is learned along with the system responses based on dialogue data (Eric and Manning, 2017) seem hard to scale.  Figure 2: The modular statistical dialogue system architecture. The dialogue manager takes the semantic interpretation as input to track the belief state. The updated state is then used by the dialogue policy to decide on the next system action.
sample data and a potentially delayed objective signal (the reward r) (Sutton and Barto, 1998). The policy selects the next action a ∈ A based on the current system belief state b to optimise the accumulated future reward R t at time t: Here, k denotes the number of future steps, γ a discount factor and r τ the reward at time τ . The Q-function models the expected accumulated future reward R t when taking action a in belief state b and then following policy π: (2) For most real-world problems, finding the exact optimal Q-values is not feasible. Instead, RL algorithms have been proposed for dialogue policy learning based on approximating the Q-function directly or employing the policy gradient theorem (Williams and Young, 2006;Daubigney et al., 2012;Gašić and Young, 2014;Williams et al., 2017;Papangelis and Stylianou, 2017).
Aside from the policy model, the dialogue model plays an important role: it defines the structure and internal links of the dialogue state as well as the system and user acts (i.e., the semantics the system can understand). Thus, the policy model is only able to learn system behaviour based on what is defined by the dialogue model. By defining the dialogue state, the dialogue model further represents an abstraction over the task ontology or knowledge base restricting the view on the information that is relevant so that the system is able to converse 2 . Most current dialogue models are built around domains which encapsulate all relevant information as a section of the dialogue state that belongs to a given topic, e.g., finding a restaurant or hotel. However, the resulting flat state that is widely used (Williams et al., 2005;Lee and Stent, 2016;Schulz et al., 2017, e.g.) is not intuitive to model complex dialogue structures like relations.
To overcome this limitation, we propose the conversational entity dialogue model which will be described in detail in the following section.

Conversational Entity Dialogue Model
The conversational entity dialogue model (CEDM) is proposed as an alternative way of statistical dialogue modelling having the concept of entities at the core of the model. Entities being objects or relations offer an intuitive way of modelling complex task-oriented dialogues.

Objects and Relations
Objects are entities of a certain object type (e.g., Restaurant or Hotel) where each type defines a set of attributes (see Fig. 1). This type definition matches the contents of the back-end knowledge base and thus the internal representation of realworld objects. This is similar to the definition of domains. In contrast to domains, though, this notion allows the modelling of multiple objects of the same type within a dialogue as well as the modelling of a type hierarchy which may be exploited during policy learning.
Relations are also entities that connect objects or attributes of objects. An example is shown in Figure 3: the two objects obj1 and obj2 of types Hotel and Restaurant respectively are connected through the attribute area with the equals relation.
Possible relations may directly be derived from the object type definitions, e.g., by allowing only connections for attributes that represent the same concepts like area. Note that these relations are dynamic relations that may be drawn between objects in a conversation. This is different to static relations which are often used in knowledge bases to describe how concepts relate to each other.

Conversational Entities in a Conversational World
A conversational entity is a virtual entity that exists in the context of the current conversation and is either a conversational object or a conversational  Figure 3: Example mapping of a user utterance to two objects and one relation.
relation. A conversational object may match a real-world entity but does not need to. In fact, the task of a goal-oriented dialogue is often to find a matching real-world entity based on the information acquired by the system during the dialogue.
In the example dialogue ( Fig. 1), matching entities have already been found for both objects. However, a conversational object exists independently of whether a matching real-world entity has been found yet or even exists. Derived from the object type definition, a conversational object comprises an internal state that consists of the user goal belief s u and the context state s c as shown in the example in Figure 4. There, s u is depicted using marginal probabilities for each slot (which is common in recent work on statistical SDS). While the user goal belief models the system's belief of what the user wants based on the user input, the context state models information that the system has shared with the user. In the example of Figure 4, the system has already offered a matching real-world object based on the user goal belief of the conversational object. If no offer has been made yet, the context state is empty.
The context state plays an important role as addressed relations usually refer to the object offered by the system instead of search constraints represented by the user goal belief. The context state further allows to relate to attributes that have not been mentioned in the dialogue.
One key aspect of the CEDM is that relations are also modelled as a conversational entity. Thus, these conversational relations also define a user goal belief and a context state as shown in Figure 5. The attributes of the relation are created out of the attributes of the objects they connect. In the given example, the attributes area and pricerange of the two objects are connected resulting in the relation attributes area2area and pricerange2pricerange. The values of these attributes are the actual relations, e.g., equals or greater/less than. Similar to the slot belief of con-  Figure 5: Example of the conversational entity Re-lation1 between obj1 and obj2. The user goal belief models the search constraints the user has provided to the system and the context state represents the relations based on the most recent real-world matches for both objects offered by the system. versational objects, each attribute is modelled with a marginal probability over all possible relations. Assigning part of the belief state to the relations enables the system to specifically react to these relations and even to address them in a system utterance. Furthermore, if the context state of one of the related objects changes (e.g., because the user changed their mind), the relation may still persist.
Each conversational entity resides within a conversational world w (see Fig. 1) that defines the number of objects and the type of each object (relations may be derived from this) as well as general state information. This world may either be predefined or needs to be derived from the user input. In the latter case, the user input is usually noisy and an uncertainty needs to be modelled within the dialogue state. As this work focuses on relation modelling, a predefined conversational world is used leaving the uncertainty modelling of conversational worlds for future work.

Belief Tracking and Focus of Attention
The task of belief tracking is to update the probability distribution b (s) over the states s based on the system action a, the observation o of the user input and the previous probability distribution b: b (s) = P (s|o, a, b) . (3) With the additional complexity of the CEDM having an unknown number of entities in a conversational world, we propose to decompose the state s in the spirit of work by Williams et al. (2005). The belief update for each entity e is then defined as where s u is the user goal state of entity e, s c the context state of e, h e the dialogue history of e and u the last user action 3 . The belief update for the world belief b w is where s w is the world state of world w, h w the dialogue history and u the last user action. This multi-part belief allows hierarchical dialogue processing on the world level and the entity level as depicted in Figure 6. Each level produces its own belief and based on that, the system is able to act on each level. On the world level, the system might produce general dialogue behaviour like greetings or engage in a dialogue to adequately identify the entity which is addressed by the user input. On the entity level, the system talks to the user to acquire information about the concrete entity the user is talking about, e.g., to find a matching entity in the knowledge base.
In addition to belief tracking, we would like to introduce another concept called focus of attention. Based on work by Grosz (1978), we define the current focus of attention F for each conversational world as a subset of conversational entities in this world F ⊆ W . Hence, the task of focus tracking is to find the new set of conversational entities which is in the current focus of attention based on the user input and the updated belief state. Even though the concept of focus is not mandatory, it may be helpful when framing the reinforcement learning problem as it allows to limit the size of the input to the reinforcement learning algorithm as well as the number of actions available to the learning algorithm at a given time. Using F may also prevent the system from acting in parts of the belief state that are completely irrelevant to the current part of the conversation. 3 In case of an unknown number of entities represented by a probability over worlds, the probability in Equation 4 needs to be extended to depend on the conversational world and needs to be multiplied by a probability over all worlds. world level world general behaviour bw entity level entity specific behaviour be

The Conversational Entity vs. the Multi-Domain Dialogue Model
The functionality and the modelling possibilities of the proposed CEDM go beyond (and thus include) the possibilities of the multi-domain dialogue model (MDDM). To demonstrate this, we will outline how a dialogue using the MDDM may be modelled using the CEDM. The core concept domain of the MDDM may be mapped to one conversational object of a specific type where the slots of the domain are the attributes of the type. Since the number of domains is predefined, there is only one conversational world with a set number of conversational objects. Relations may not be modelled using the MDDM. Belief update is reduced to finding the right entity for the user input and updating its state. In the CEDM, the semantic decoding of user input includes the entity (or entity type) it refers to, which is similar to the topic tracker of the MDDM where the topic tracker also defines the domain the system acts in. Hence, the focus of attention will always contain only the entity that has been addressed by the user. By that, a policy for each conversational object (and thus object type) may be trained which is the same as the domain policies of the MDDM.

Relation Modelling Evaluation
To demonstrate the capabilities and benefits of the conversational entity dialogue model (CEDM), the aspect of relation modelling has been selected as it is a core concept of the CEDM. For this, we built upon the mapping to the multi-domain dialogue model (MDDM) as described in Section 4.4 and extend it with conversational relations. After a brief description of the model implementation, the experiments and their results are presented using two conversational objects of different types. Note that only the equals relation is considered here due to limitations of the marginal belief state model.

Model Implementation
To implement all relevant aspects of the CEDM, the publicly available open-source statistical dialogue system toolkit PyDial  is used which originally follows the MDDM. The main challenge for policy implementation is to integrate both the state of the object in F as well as the states of all corresponding relations into the dialogue decision. To achieve this, a hierarchical policy model based on feudal reinforcement learning (Dayan and Hinton, 1993) has been implemented following the approach of Casanueva et al. (2018). For each object type, a master policy decides whether the next system action addresses a conversational relation or the conversational object. A respective sub-policy is then invoked in a second step where each object type and each relation type are modelled by an individual policy. Thus, the model decomposes the action selection problem to take account for the specificities of the object policy and relation policies respectively and is able to handle a variable number of relations and a large state space. During training, all policies (master and sub-policies) receive the same reward signal.
Aside from the feudal RL architecture which seems to be intuitive for the proposed CEDM, the main problem is the handling of back-end database access. In the MDDM, each domain models all information which is necessary to do the data-base lookup. However, this is not possible in the CEDM as information from different conversational objects and relations need to be taken into account. One way of doing this is to apply a rulebased merging of the state of the conversational object in F with the states of all other conversational objects that are related through a conversational relation to form the focus stateb: where s is the slot, v is the value, and b i the belief of the i-th conversational entity involved in the merging process. This example also shows that conflicts which may exists between the state of the conversational object and the state defined by the relation are visible at this level. To help the policy to learn in this situation, an additional conflict bit is added to the focus belief state as input to the master policy.
The source code of the CEDM implementation is available at http://pydial.org/cedm.

Experimental Setup
To evaluate the relation modelling capabilities of the CEDM, the task of finding a hotel and a restaurant in Cambridge has been selected (corresponding to the CamRestaurants and CamHotels domains of PyDial). The corresponding conversational world consists of two conversational objects of types hotel and restaurant and one conversational relation. Based on the object type definitions, the conversational relation connects the slots area and pricerange of both objects. Using a simulated environment, the goals of the simulated user were generated so that at least one of these two slots is related (i.e., contains the same value).
To test the influence of the user addressing the relation instead of the correct value (e.g., "restaurant in the same area as the hotel" vs. "restaurant in the centre"), we have extended the simulated agenda-based user (Schatzmann and Young, 2009) with a probability r of the user addressing the relation instead of the value. The higher r, the more often the user addresses the relation. The user simulator is equipped with an additional error model to simulate the semantic error rate (SER) caused in a real system by the noisy speech channel.
For belief tracking, an extended version of the focus tracker (Henderson et al., 2014)-an effective rule-based tracker-was used for the conversational entities and the conversational world that also discounts probabilities if the respective value has been rejected by the user. As a simulated interaction is on the semantic level, no semantic de-coder for the relations is necessary. For training and evaluation of the proposed framework, both the master policy and all sub-policies are modelled with the GP-SARSA algorithm (Gašić and Young, 2014). This is a value-based method that uses a Gaussian process to approximate the statevalue function (Eq. 2). As it takes into account the uncertainty of the estimate, it is sample-efficient.
To compare the dialogue performance of the CEDM with the MDDM baseline, two experiments have been conducted. All dialogues follow the same structure: the user and the system first talk about one conversational object before moving on to the second object. As the user only addresses a relation to an object that has previously been part of the dialogue, relations are only relevant when talking about the second object. However, there are times where a relation has been addressed by the user before the goal of the first object changed which resulted in the addressed relation being wrong. This could only be resolved by the system by addressing the relation itself. Experiment 1 In the first experiment, the influence of r on the dialogue performance is investigated in a controlled environment. Having a fixed order, only the feudal policy of the second object (where relations may occur), the restaurant, is learned. To avoid interfering effects of jointly learning both policies at the same time, the first object hotel uses a handcrafted policy. Experiment 2 The second experiment focusses on the joint learning effects. Thus, the order of objects is alternated, all objects use the feudal policy model and are trained simultaneously.

Results
The experiments have been conducted based on the PyDial simulation environments Env. 1 and Env. 3 specified by  where Env. 1 operates on a clean communication channel with an SER of 0% and Env. 3 simulates an SER of 15%. For each experiment, a policy for the respective object types was trained with 4,000 and tested with 1,000 dialogues. The reward was set to +30/+0 for success/failure and -1 for each turn with a maximum of 25 turns per object. The results were averaged over 5 different random seeds. Experiment 1 As can be seen in Table 1 and Figure 7 on the left, the proposed CEDM with a feudal policy model is easily able to deal with relations addressed by the user for any relation prob-ability r in both environments. Success rate and reward achieve similar results for all r. Only for very high r, a small reduction in performance is visible. This can be explained with the added complexity of the dialogue itself as well as the system actions that address the relations. A high relation probability for a slot requires the system to address either the relation or the slot value directly. Both actions may have similar or contradicting impact on the dialogue which makes it harder to learn a good policy. In Env. 3, the added noise results in minor fluctuations which may be expected.
In contrast, the baseline (the MDDM) is not able to handle the user addressing relations adequately for higher r: while for low r, the policy is able to compensate by requesting the respective information again, the performance drops at around r = 0.5. The reason why the performance of the baseline does not drop as much in Env. 3 as it does in Env. 1 is the way the simulated error model of the simulated user operates. By producing a 3-best-list of user inputs, the chance that the actual correct value is introduced as noise if a relation has originally been uttered is relatively high. As the n-best-list of Env. 1 has the length of one, this does not happen there.
The performance of the hand-crafted hotel policy was similar for all r in Env. 1 with rew = 23.4, suc = 99.7% and in Env. 3 with rew = 20.1, suc = 94.5%.
Analysing the system actions of the dialogues of the CEDM shows that the system learns to address a relation in up to 28% of all dialogues for r = 1.0.
Example dialogues for Env. 1 are shown in Figures 8 and 9. Experiment 2 The results shown in Table 1 and Figure 7 on the right show the performance of the conversational object policies when the respective object was the second one in the dialogue (where relations occur). Still, policies of both objects were trained in all dialogues. The effects of this added noise become visible in the results as they seem to be less stable. Furthermore, the overall performance for the restaurant policy drops a bit, but still shows the same characteristics as in Experiment 1. Learning a hotel policy results in worse overall performance (which matches the literature) and in cases where a relation is involved.
The performance of the policy of the first object was similar for all r where the restaurant policy achieved rew = 21.5, suc = 95.4% and the hotel  policy rew = 18.8, suc = 90.2%.
Analysing the system actions of the dialogues shows that the CEDM learns to address a relation in up to 24.5% of all dialogues for r = 1.0.

Conclusion and Future Work
In this paper, we have presented a novel dialogue model for statistical spoken dialogue systems that is centred around objects and relations (instead of domains) thus offering a new way of modelling statistical dialogue. The two major advantages of the new model are the capability of including multiple objects of the same type and the capability of modelling and addressing relations between the objects. By assigning a part of the belief state not only to each object but to each relation as well, the system is able to address the relations in a system response.
We have demonstrated the importance of the aspect of relation modelling-a core functionality of our proposed model-in simulated experiments showing that by using a hierarchical feudal pol-icy architecture, adequate policies may be learned that lead to successful dialogues in cases where relations are often mentioned by the user. Furthermore, the resulting policies also learned to address the relation itself in the system response.
However, only a small part of the proposed dialogue model has been evaluated in this paper. To explore its full potential, many questions need to be addressed in future work. For creating a suitable semantic decoder that is able to semantically parse linguistic information about relations, an extensive prior work on named entity recognition and dependency parsing already exists and needs to be leveraged and applied to conduct real user experiments. Moreover, relations other than equals need to be investigated. Finally, the challenges of identifying all conversational entities in the dialogue and assigning the correct one to each user action as well as finding suitable belief-tracking approaches for the proposed multi-layered architecture along with effective policy models need to be addressed.