Spoken Dialogue for Information Navigation

Aiming to expand the current research paradigm for training conversational AI agents that can address real-world challenges, we take a step away from traditional slot-filling goal-oriented spoken dialogue systems (SDS) and model the dialogue in a way that allows users to be more expressive in describing their needs. The goal is to help users make informed decisions rather than being fed matching items. To this end, we describe the Linked-Data SDS (LD-SDS), a system that exploits semantic knowledge bases that connect to linked data, and supports complex constraints and preferences. We describe the required changes in language understanding and state tracking, and the need for mined features, and we report the promising results (in terms of semantic errors, effort, etc) of a preliminary evaluation after training two statistical dialogue managers in various conditions.


Introduction
There has been an increasing amount of research being conducted on many aspects of Spoken Dialogue Systems (SDS) with applications ranging from welldefined goal-oriented tasks to open-ended dialogue, e.g., (Amazon, 2017). Deep learning and joint optimisations of SDS components are becoming the standard approach e.g., Li et al., 2016;Williams et al., 2017;Liu et al., 2017;Cuayáhuitl et al., 2017;Yang et al., 2017), showing many benefits but also limitations and disadvantages. Due to the complexity of the problem, most of these approaches focus on limited applications e.g., information retrieval on small domains or shallowunderstanding chat-bots.
Moving towards conversational AI, we shift the paradigm to information navigation and present in this work a more realistic goal-oriented setup. The proposed paradigm is designed towards complex interactions using semantic knowledge bases and linked data (Heath and Bizer, 2011), and allows users to be more expressive in describing their constraints and prefer-ences. We aim to enable users to make informed decisions by understanding their needs and priorities through conversation with an intelligent agent.
In this work we extend the Linked Data Spoken Dialogue System (LD-SDS) system proposed in (Papangelis et al., 2017) in the following directions: a) we propose features mined over the set and the order of objects in the current user focus, b) we modify the language understanding and belief state tracking modules to support the proposed complex interactions over rich information spaces, c) we apply an agenda-based user simulator to train two statistical dialogue manager models, and d) we conduct a preliminary evaluation with promising results.

Challenges and Requirements
As our paradigm moves towards information navigation, we assume that the users have a vague idea of what they are looking for and through interaction with the system they can understand their own needs better. The user's intents, therefore, do not always express hard restrictions (constraints) but often express preferences 1 that users may or may not be willing to relax as the dialogue progresses. Such preferences may refer to the importance of attributes over other attributes (e.g., location is much more important than has-freewifi when searching for accommodation), or may refer to preferred values of a given attribute (e.g., prefer central over northern locations but northern may still be okay under certain circumstances), etc. Moreover, it is worth highlighting aspects of items that may have not been mentioned but have high discriminative power within their cluster (e.g., 5 hotels match the user's preferences but there's one with vegan menu).
Towards this objective, we propose the interaction of SDS with exploratory systems that offer the aforementioned functionality over semantic knowledge bases. This requires extensions in language understanding and state tracking, and the need for mined features.

Background: Preference-Enriched Faceted Search and Hippalus
Faceted search is currently the de facto standard in e-commerce (e.g., eBay, booking.com), and its popularity and adoption is increasing. The enrichment of Faceted Search with preferences, hereafter Preferenceenriched Faceted Search (PFS), was proposed in (Tzitzikas and Papadakos, 2013). It has been proven useful for recall-oriented information needs, because such needs involve decision making that can benefit from the gradual interaction and expression of not only restrictions (hard constraints) but also preferences (soft constraints). It is worth noting that it allows expressing preferences over attributes, whose values can be hierarchically organized and/or multi-valued, it supports preference inheritance, and it offers scope-based rules for automatic conflict resolution. PFS offers various preference actions (e.g., relative, best, worst, around, etc.) that allow the user to order facets (i.e. slots), values, and objects. Furthermore, the user is able to compose object related preference actions 2 . Essentially, a user u can express gradually a set of qualitative (i.e. relative) preferences over the values of each facet (slot), denoted by P ref u . These actions define a preference relation (a binary relation) over the values V si of each slot s i , denoted by i , which are then composed to define a preference relation over the elements of the information space, i.e. over V = V si × ... × V sn (in the case of multi-valued slots V = P(V si ) × ... × P(V sn )). Since the descriptions of the objects in the current user focus F u are a subset of V , the actions in P ref u define a preference relation over F u denoted as (F u , P refu ), from which a bucket order of F u , i.e. a linear order of subsets of F u ranked based on preference and denoted by B(F u , P ref u ) =< b 1 , ..., b z >, is derived through topological sorting.
Hippalus (Papadakos and Tzitzikas, 2014) is an exploratory search system (publicly accessible 3 ) that materializes PFS over semantic views gathered from different data sources through SPARQL queries. The information base that feeds Hippalus is represented in RDF/S and objects can be described according to dimensions with hierarchically organized and set-valued attributes. Preference actions are validated using the preference language described in (Tzitzikas and Papadakos, 2013). If valid, the system computes the respective preference bucket 4 order and returns the corresponding ranked list of objects.
In addition, Hippalus implements the scoring function defined in (Tzitzikas and Dimitrakis, 2016), that expresses the degree up to which an object in F u fulfills the preferences in P ref u and is a real number (in our case its range is the interval [1, 100]). The specific scoring function, exploits all available composition modes available in Hippalus enriching the bucket orders with scores respecting the consistency of the qualitative-based bucket order that is defined as: A scoring function score is consistent with the qualitative-based bucket order, if for any two objects o, o and any set of user actions P ref

Motivation
In order to reduce the complexity of the dialogue system while at the same time improving its efficiency and effectiveness, we enriched the response of the Hippalus system with a number of features, which provide cues about interesting slots/values (as mentioned in §2.1) that can be exploited by the Belief Tracker, Dialogue Manager, Natural Language Generator, and other statistical components of the SDS. These features are extracted from: a) the set of objects of the current user focus (selectivity and entropy); and b) from the imposed ordering of the objects according to the expressed user preferences (avg, min and max preference score per bucket and pair-wise wins of objects per slot per bucket).

Features extracted from object focus
Assume a dataset D that contains |O D | objects, where F u ⊆ O D is the current focus of the user u (i.e. the objects that satisfy the expressed hard-constraints). Let S |Fu = {s 1 , ..., s n } denote the set of available slots in D under focus F u and V s i|Fu = {v si1 , ..., v sim } denote the set of values for slot s i ∈ S |Fu respectively 5 . We define the following metrics: Definition 3.1. The selectivity of a slot s i under focus F u is defined as: Definition 3.2. The entropy of a slot s i under focus F u is defined as: Both selectivity and entropy metrics provide insights about the discreteness and the amount of information contained in the values of a specific slot for the objects under focus F u . Selectivity is an inexpensive but rough metric that takes values in [0, 1]. If the value of each object for a specific slot is unique, then selectivity is 1 (high selectivity), while it is near 0 for the opposite (low selectivity). On the other hand entropy is a refined but more expensive metric, with bigger values when the probabilities of values in V s i|Fu are equal. Hippalus returns the values of both metrics for each slot of the current user focus F u on the fly, along with the precomputed values for the whole dataset.

Features extracted from object order
Other interesting features can be extracted from the imposed ordering of objects based on the user preferences, including min, max, and average preference score of objects in each bucket, and for each object of a bucket the sum of pair-wise wins per each slot over which the user has expressed a preference. The last feature can be used as an indication about the number of wins of each object over all different preference criteria (slots), pinpointing criteria that affect only a small number of objects.
Definition 3.3. The pair-wise wins PWW metric under focus F u of objects contained in a bucket b ∈ B Fu,P refu derived by preference actions P ref u of user u for slot s, is defined as: Notice that big PWW values mean that we have a small number of objects, even a single object, that win the rest objects of the bucket for the preference actions of a specific slot. As an example consider a bucket that contains the cheapest hotel. This hotel wins the rest objects of the bucket for the slot price and could be used by the dialogue system to ask if price is considered more important than the rest slots (i.e. expression of priority). On the other hand lower values mean that we have a number of ties for the objects of a bucket, and that the dialogue system is not able to pin-point specific slots that could further restrict the top-ranked objects. Figure 1 shows the architecture of our system. Hippalus is responsible for feeding information regarding the current knowledge view to the SLU and DST components. In addition, it provides the previously mentioned features and the current ranked list of results to the multi-domain policy, and Natural Language Generation (NLG) and Text to Speech (TTS) components respectively. Spoken Language Understanding (SLU) and dialogue state / belief tracking (DST / BT), have been extended with operations that correspond to the actions supported by Hippalus. Since Hippalus supports hierarchical and multivalued attributes, the notion of slot has been extended to allow the definition of relations between slot values.

Dialogue Management
The objective is to conduct dialogues with as few semantic errors as possible that result in successfully completed tasks and satisfied users. As baselines for dialogue management, we created a hand-crafted Dialogue Manager (DM) and trained two statistical DMs in simulation. To this end, we developed an agenda-based user simulator (Schatzmann et al., 2007) that was designed to handle the complexities and demands of our SDS, e.g., real values for slots, intervals, hierarchies, all of our operators, hard constraints and preferences, etc., as well as to be able to handle multiple items being suggested by the system (in the sense of an overview of current results) and tell if these items satisfy the user's constraints. In order to handle a wide range of domains, we use the method proposed in (Wang et al., 2015), which extracts features describing each slot and action plus some general features pertaining to the dialogue so far and the current state of the knowledge base. Thus, even if new slots are added to the knowledge base, our dialogue manager does not need to be retrained. Specifically, we use some of the features proposed in (Wang et al., 2015;Papangelis and Stylianou, 2016) and the features described in the previous section, which are necessary to handle the increased complexity of the interaction.

Understanding and State Tracking
Translating the identified user intentions from SLU into a belief state is not trivial, even for slot filling models with one or two operators (e.g., =, =). Moreover, as we aim to connect our system to live knowledge bases, it is important for SLU to be able to adapt over time, as well as handle out-of-domain input gracefully. As an initial approach to belief tracking, we follow some simple principles (Papangelis et al., 2017) in conjunction with an existing belief tracker . While this is straightforward for regular slots, we need a different kind of belief update for hierarchically valued or multi-valued slots. Specifically, for hierarchical slots we need to recursively perform the belief update, while still following the basic principles. As the constraints become more complex, traversing the hierar-  chy of values becomes non-trivial. In our prototype, we traverse the hierarchy once for each constraint (relevant to a specific hierarchical slot) and then combine the updates into a single belief update as the average for each value. When updating multi-valued slots, we assign the probability mass to each value that was mentioned (and not negated); this can be seen as generating (or removing) a single-valued "sub-slot" for each value on the fly.

Preliminary Evaluation
To assess how well current statistical DMs perform in this setting, we compare a hand-crafted dialogue policy (HDC) against a DM trained with GP-SARSA (GPS) (Gašić et al., 2010) and one trained with Deep Q Networks with eligibility traces (DQN-λ) -an adapted version of (Harb and Precup, 2017). HDC, GPS, and DQN (without eligibility traces) have been the top performing algorithms in a recent benchmark evaluation . We test the DMs under various conditions, presented in Table 1. Semantic Error refers to simulated errors, where we change either the type of dialogue act, slot, value, or operator that the simulated user issues, based on some probability. This can happen multiple times, to generate multiple SLU hypotheses. SLU N-Best Size is the maximum size of the N-best list of SLU hypotheses, after the simulated error stage. Sim. User Patience is the maximum number of times the simulated user tolerates the same action being issued by the DM. Max User Constraints is the maximum number of constraints in the simulated user's goal (e.g., price ≤ 70). One important observation is that task success is very hard to define, as we consider a cluster of ranked items to be a valid system response. Some users may want to get exactly one option while for some it may be acceptable to get no more than four. Therefore, we add a feature to our user simulator to indicate the number of items a user will accept as a final result (provided that all of them meet the user's constraints). We sample this uniformly from the set {1, ..., acceptable}, as defined in Table 1 (Acceptable Num. Items). While this is a rough approximation of real world conditions, we expect that it introduces one more layer of complexity that the statistical DMs need to model. The dataset used for the evaluation consists of four domains (Hotels, Restaurants, Museums, and Shops) with databases populated with content scrapped from the internet, containing a total of 84 slots and 714 ob-  jects. We evaluated the statistical DMs on a single domain and on a multi-domain setting (as described in section 4.1). Table 2 summarizes the results of our evaluation in simulation in the four environments we have defined, where each entry is the average of 5 runs of 1,000 training and 100 evaluation dialogues. DQN-λ performs better with the rich (dense) domainindependent feature set in the multi-domain scenario, likely because it is exposed to more variability in the data and therefore needs less iterations to learn wellperforming policies. In fact, it is able to cope very well in deteriorating conditions, by learning to adapt e.g., by asking for more confirmations. GPS shows the opposite trend, preferring the sparse belief state features of the single-domain scenario, needing many more dialogues (than the 1,000 allowed here) to reach good performance in the multi-domain case.

Conclusion
We have presented LD-SDS, a prototype information navigation SDS that connects to semantic knowledge bases to guide users towards making informed decisions. This direction is more challenging compared to other simpler kinds of interaction. To evaluate the quality of the approach that we propose, we developed an agenda-based user simulator and applied it to train two statistical DMs. While we have proven the feasibility of our approach, our system still needs to be trained and evaluated with human users as in some cases statistical DMs may overfit simulators (or take advantage of certain aspects of them). We are therefore in the process of designing studies to collect text-based human-human data that will be used to train LD-SDS either end-toend or by jointly optimising some of the components. In addition, we plan to evaluate our approach with live semantic knowledge bases and extend our approach to also exploit available unstructured information (out of domain). In the appendix we show an example dialogue with our system that highlights the extensions to the typical slot-filling approach.

A Supplemental Material
In this section, we provide an example interaction between a human user and the LD-SDS prototype. Figure 2 shows the system in operation. Figures 3 and 4 show examples of slots that can take multiple values or whose values have hierarchical relations, respectivelly.    S: 3 hotels match your preferences. Two are located in Shimogyo and one in Nakagyo. Ryokan Kyoraku in Shimogyo is cheaper with 59 pounds per night and 3 stars, and Daiwa Roynet is more expensive at 81 pounds per night but with 4 stars. Royal Park in Nakagyo is expensive as well at 79 pounds per night and 4 stars. Both expensive hotels offer more services than the cheaper one. U: Thank you, goodbye. (0.97125274) Table 3: Example interaction between a human user and our LD-SDS prototype. In the interest of space, the notes under each dialogue turn briefly show items that correspond to new information. The belief state is updated accordingly. ASR: Automatic Speech Recognition.