User Evaluation of a Multi-dimensional Statistical Dialogue System

We present the first complete spoken dialogue system driven by a multiimensional statistical dialogue manager. This framework has been shown to substantially reduce data needs by leveraging domain-independent dimensions, such as social obligations or feedback, which (as we show) can be transferred between domains. In this paper, we conduct a user study and show that the performance of a multi-dimensional system, which can be adapted from a source domain, is equivalent to that of a one-dimensional baseline, which can only be trained from scratch.


Introduction
Data-driven approaches to spoken dialogue systems (SDS) are limited by their reliance on substantial amounts of annotated data in the target domain.This can be addressed by considering transfer learning techniques, e.g.(Taylor and Stone, 2009), in which data from a source domain is leveraged to improve learning in a target domain.In particular, domain adaptation has been used in the context of dialogue systems (Gašić et al., 2017;Wang et al., 2015;Wen et al., 2016), focusing on identifying and exploiting similarities between domain ontologies in slot-filling tasks.
In contrast to this previous work, we take a multidimensional approach, which combines machine learning with linguistic theory.Following Bunt (2011), we exploit the linguistic phenomenon that utterances serve more than one function in a conversation, i.e. they have more than one dimension (see Section 2). 1 For example, the utterance "On what date would you like to fly to London?" both asks a task-oriented question, and provides feedback about understanding the requested destination.We take advantage of this phenomenon by training 1 See also https://dit.uvt.nl/.
separate, fully-statistical dialogue models for each dimension and generating system responses along multiple dimensions simultaneously.Such an SDS thus has the potential to adapt more efficiently to new domains by exploiting previously trained policies of the domain-independent dimensions, such as feedback and social conventions.
Previous implementations of multi-dimensional SDSs were mostly handcrafted (Akker et al., 2005;Petukhova et al., 2016).Keizer and Rieser (2017) were the first to present a statistical multidimensional dialogue manager (DM).Their results suggest an up to 80% reduction in data: a task success rate of over 90% can be achieved after only 2,000 dialogues when using pre-trained policies, whereas at least 10,000 dialogues are required without pre-training.In comparison, Gašić et al. (2017) achieve similar success rates for in-domain systems trained on 5,000 dialogues.However, Keizer and Rieser's findings are only tested in simulation.
In this paper, we present the first complete statistical SDS with multi-dimensional DM, and the first crowdsourced human user evaluation of this type of system, comparing a one-dimensional baseline and three multi-dimensional variants, using a novel web-based setup.A novel aspect of our statistical analysis is testing for equivalence.The four system variants were designed in such a way that we would expect their performance levels to be indistinguishable when using fully trained policies.Should the data provide statistical evidence for this, the multidimensional variants can be preferred due to their inherent potential for domain transfer.

A Multi-dimensional Dialogue Manager
Our DM is a partially-observable Markov decision process (POMDP; Young et al., 2013) 1: An example of multiple dimensions in a dialogue: the user both greets the system and asks for a cheap Indian restaurant, before releasing the turn; the system then takes the turn while giving positive feedback, and indicates that it needs some time to retrieve the requested information; in the second part the system both provides this information and gives feedback about understanding the user's question (underlined).
updates the dialogue state and then selects a response in the form of one or more dialogue acts.
Rather than selecting a single action from one set of possible actions, our DM consists of multiple dialogue act agents, each of which selects an action from a separate action set, associated with one dimension.These action sets are based on three of the ten dimensions defined in the ISO standard for dialogue act annotation (ISO, 2012): Task (e.g.recommending a restaurant), AutoFeedback (e.g.asking the user to repeat/rephrase after a processing problem), and Social Obligations Management (SOM; e.g.responding to the user saying goodbye).These dimensions were considered to be the most important for supporting the kind of taskoriented dialogues targeted (see Fig. 1 for an example).While the Task dimension is domain-specific, AutoFeedback and SOM are applicable across domains.
Training the statistical DM on these three dimensions involves optimising three policies in parallel.A set of priority rules is used to combine the output of these policies into a single system response.The key advantage of such a design is that the domain-independent policies (AutoFeedback and SOM) can be transferred and adapted to a new domain, leaving only the Task policy to be trained from scratch.In our previous work (Keizer and Rieser, 2017), we have shown that a multidimensional DM with pre-trained policies reaches higher performance levels during the early stages of training.Here, we take an important step in confirming this advantage in a real user study.Our framework currently supports informationseeking domains, such as recommending restaurants or hotels based on the user's preferences.The domains are specified in terms of an ontology (describing slots such as price range and cuisine) and a database.Our domains are presented in Table 1.We use restaurant information as target domain, but two of the system variants were trained for the hotels domain (source) and then adapted to the restaurant domain.

Model Variants
For the evaluation, we follow Keizer and Rieser (2017)'s four DM variants and training regime: The one-dimensional one-dim baseline system contains a single dialogue act agent (ALL) and the corresponding policy was trained from scratch in the target domain.The multi-dimensional systems use three dialogue act agents, one of which is domain-specific (TASK) and the other two domaingeneral (AUTOFEEDBACK and SOM).For the base multi-dim system, the three policies are trained from scratch in the target domain, whereas the trans-fixed and trans-adapt variants employ transfer learning (Pan and Yang, 2010;Torrey and Shavlik, 2010): only the task-specific policy is trained from scratch and the two domain-general policies are previously trained in the source domain.For trans-fixed, the pre-trained policies are kept fixed during training in the target domain, whilst for trans-adapt, these are further trained in the target domain.The four fully trained DM versions are outlined in Table 2.

Training Details
All policies are optimised in simulation using multiagent reinforcement learning with linear value function approximation, based on a single reward signal shared between the agents. 2To train all systems, we use the agenda-based user simulator of Keizer and Rieser (2017), which is based on (Schatzmann et al., 2007), along with the following error model: In addition to creating an n-best list of user dialogue act hypotheses from the 'true' user act, we also occasionally insert so-called 'processing problems', at the levels of perception (no ASR results received) or interpretation (ASR successful, but no NLU results received).We simulate a perception problem with 10% probability, and in case of no perception problem (90%), we simulate an interpretation problem with 10% probability; only in case no processing problems are generated (81%), an n-best list of dialogue act hypotheses is generated.Following Thomson et al. (2012), the n-best lists are populated by taking the true user act and distorting it at a given semantic error rate for each of the positions, after which semantically equivalent hypotheses are merged.Based on the error rate, a Dirichlet distribution is used to generate confidence scores for the n-best list (resulting in a semantic top accuracy equal to the error rate), interpreted as probabilities by the DM when updating its user goal belief state. 3 In order to correctly interpret the evaluation results, note that in the current setup, the one-dim system serves as an upper bound baseline system, as it needs no coordination between different agents during training whilst generating (by construction) the same range of actions as the multi-dimensional systems.This is ensured by a set of priority heuristics which map action combinations to single acts. 4

DM Evaluation in Simulation
To get a better picture of what we might expect during the human evaluation, we first ran evaluations with simulated data.The results obtained with the same settings as those during training are shown in Table 3.As we hypothesised, the scores are very similar, the one-dim system only slightly outperforming the multi-dimensional systems.
We then extended the setup with different semantic error rates (Thomson et al., 2012); the results are shown in Fig. 2. The performance levels of the For each of the four DM versions, 5 training runs over 60k dialogues were carried out, resulting in a pool of 5 fully trained policies.
3 The n-best size was set to 3 and the error rate was set to 30% for the target domain (restaurants) and 20% for the source domain (hotels).
4 E.g. if the Task agent generates a recommendation action and the AutoFeedback agent generates a negative feedback action, the latter gets priority and the former is cancelled.
four systems are very similar at error rates between 10% and 40%, showing that the construction of the multi-dimensional versions in relation to the one-dim baseline is sound, and showing there is no negative transfer, i.e., the adapted systems are not performing worse. 5

Evaluation Setup
We use crowdsourcing to evaluate our system, following Jurčíček et al. (2011) andCrook et al. (2014).In both of these works a phone-based system was deployed, using a bespoke ASR and Voice over IP (VoIP) to connect speech input/output with the dialogue system.Here, we follow a similar evaluation methodology, but with a novel, simpler web-based interface using Google Chrome's builtin web speech API, embedded into the crowdsourcing task webpages.A detailed description of the technical setup can be found in Appendix A.

Crowdsourcing Setup
The users are recruited on the FigureEight crowdsourcing platform and asked to have a conversation with the system to find a venue meeting certain criteria (e.g.cheap Chinese food) and get certain information about that venue (e.g.phone number and address).This scenario is specified in natural language, generated automatically from a set of task specifications randomly generated from the domain ontology.After each conversation, the user is given a questionnaire to rate the system.

Evaluation Metrics
The subjective evaluation metrics are derived from the following questionnaire, with one yes/no question (Q1) and four 6-point Likert Scale ratings.The following objective success metrics are derived from the logs: EntProv: the system recommended an entity matching the task constraints, ConstrConf: the system confirmed all task constraints in its recommendation, InfoProv: the system provided all information requested by the user.

Human User Evaluation
In total, 982 dialogues were collected (see Table 4), i.e. 246 dialogues per system variant on average.
We carried out a number of statistical tests to analyse the observed effect sizes in comparing the systems, including chi-squared (for success rates) and Mann-Whitney tests (for the Likert scale ratings), but also the 'two one-sided test', or TOST (Schuirmann, 1987), for equivalence, as argued in Section 2.1.In a TOST scenario, the null hypothesis is that the difference in performance between two systems, ∆, is greater than a given threshold (a hyperparameter).This translates into two onesided null hypotheses: If both H lo and H hi are rejected, we can conclude that − < ∆ < + , i.e. the difference lies below the threshold.This test is much more conservative than failing to reject the null hypothesis in a conventional statistical test of significant difference.
The underlying one-sided tests can differ according to the nature of data at hand.The default proposed by Schuirmann (1987) is t-tests.However, our data fails the normal distribution assumption of a t-test.Therefore, we use the robust t-test of Yuen and Dixon (1973) for testing equivalence on Likert scale data, which does not assume normality, and a pooled z-test with continuity correction (Fleiss et al., 2003, p. 53ff.) for success rates. 6We used a  threshold of = 10% for the equivalence tests.

Evaluation Results
Table 5 shows the results for both objective and subjective metrics.When considering the metrics for task success (SubjSucc, EntProv, ConstrConf, In-foProv), the one-dim system is the highest scoring, although the trans-adapt system is often a close second and in some cases the top scorer.However, no statistically significant differences were detected, and the one-dim system was moreover found to be equivalent to the multi-dim (p = 0.024) and trans-adapt (p = 0.002) systems in perceived success (SubjSucc), and all three multi-dimensional systems were found to be equivalent to each other (p = 0.006, 0.009, and 0.031).Similarly, several equivalences were detected for the three objective success metrics, as illustrated in Appendix B. 7 All systems are equivalent on the other subjective ratings Q2-Q5.
To get a sense of the noise levels encountered by the different system variants, we collected crowdsourced transcriptions of 2,931 utterances from 496 dialogues (45.6% of the total number of turns in the evaluation corpus and 50.5% of collected dialogues), spread approximately evenly across all system variants.We then computed word error rate (WER). 87 Following Armstrong (2014), we do not apply a correction for multiple comparisons (Lauzon and Caffo, 2009) since we only performed a limited number of pre-planned comparisons and did not require testing against the universal null hypothesis "nothing is significant". 8The reference transcriptions were obtained by majority voting over the three transcriptions collected for each utterance, with manual fixes in case of a tie (20% of the utterances).
Results in Table 6 show comparable noise levels for all system variants.No significant differences were found and equivalence tests confirmed WER to be equivalent for all the systems.This confirms that none of the systems was disadvantaged and the results in Table 5 are indeed comparable.

Conclusion and Future Work
In this paper, we have shown that a multidimensional, data efficient dialogue manager performs equally to a one-dimensional, more datahungry (upper) baseline.In doing so, we have developed a web-based platform for spoken dialogue system evaluation, carried out a crowdsourced user evaluation, and introduced statistical testing for equivalence in our analysis of the results.All code and data used in our experiments are available at: https://bitbucket.org/skeizer/madrigal The results show that none of the systems outperformed the other systems consistently across various metrics, and more importantly, that several statistical equivalences between the systems could be detected.We believe that these results are encouraging, especially since we suspect that the use of a web-based speech interface (with inherently varying quality of the microphone used) and the crowdsourcing setup (with inherently varying conditions in which workers do their tasks) resulted in a relatively high level of variance in the data, making it harder to draw strong conclusions.
In the next stage of our research, we aim to further demonstrate the cross-domain transfer capability of the dialogue manager, for example by evaluating partially trained policies, and showing that policies that use transfer learning reach higher performance levels in the early stages of training, or that they achieve a given performance threshold with much less data.

A Dialogue System Setup
An overview of our crowdsourced dialogue system evaluation setup is shown in Fig. 3.The core component of the spoken dialogue system is the Dialogue System Server, which contains the DM (see Section 2), extended with a template-based NLG component and code for processing NLU results from Microsoft's LUIS (Williams et al., 2015).Our LUIS model was trained with 299 manually constructed and annotated example utterances.
The system is completed by a web-based user interface, which connects with both the Dialogue System Server and the Google Web Speech API. 9 User audio input is first sent to Google ASR to get user utterance hypotheses with confidence scores.These are sent to the Dialogue System Server, which returns a system response utterance.Finally, this utterance is sent to Google TTS, which returns the synthesised system response audio to be played back to the user.The web interface is integrated into the FigureEight crowdsourcing platform for managing the evaluation (Section 3.1).

B Equivalence test results
See Figure 4 for a diagram of all statistically significant equivalences that we detected with respect to the individual evaluation criteria (see Sections 3.2 and 4).
Q1 [SubjSucc]:Did you find all the information you were looking for?Please state your attitude towards the following statements:Q2 [VoiceInt]: The system was easy to understand (the voice was intelligible).Q3 [Understand]: In this conversation, the system understood what you said.Q4 [AsExpect]: The system worked the way you expected it to during the conversation.Q5 [WdUseAgain]: From your experience with the system, you think you would use it in the future to find a place to eat.source: trained source: trained target: trained target: fixed target: adapted SOM -source: -source: trained source: trained target: trained target: fixed target: adapted Table 2: Evaluated systems: one-dim is a one-dimensional (upper) baseline, other systems are multi-dimensional.(a) Success rate.(b) Average dialogue length.(c) Average reward.

Figure 2 :
Figure 2: Results in simulation at different error rates.

Table 1 :
Overview of task domains.

Table 4 :
Corpus statistics: the number of dialogues collected (NumDials) and the average number of turns per dialogue (NumTurns) with standard deviation (StDv).

Table 5 :
Overview of subjective and objective evaluation results (cf.Section 3.2 for metrics).

Table 6 :
WER analysis results (NumDials indicates the number of dialogues transcribed for each system).