Turn-taking phenomena in incremental dialogue systems

In this paper, a turn-taking phenomenon taxonomy is introduced, organised according to the level of information conveyed. It is aimed to provide a better grasp of the behaviours used by humans while talking to each other, so that they can be methodically replicated in spoken dialogue systems. Five interesting phenomena have been implemented in a simulated environment: the system barge-in with three variants (resulting from either an unclear, an incoherent or a sufﬁcient user message), the feedback and the user barge-in. The experiments reported in the paper illustrate that how such phenomena are implemented is a delicate choice as their impact on the system’s performance is variable.


Introduction
A spoken dialogue system is said to be incremental when it does not wait until the end of the user's utterance in order to process it (Dohsaka and Shimazu, 1997;Allen et al., 2001;Schlangen and Skantze, 2011). New audio information is captured by an incremental Automatic Speech Recognition (ASR) at a certain frequency (Breslin et al., 2013) and at each new step, the partial available information is processed immediately. Therefore, the system is able to replicate a rich set of turntaking phenomena (TTP) that are performed by human beings when talking to each other (Sacks et al., 1974;Clark, 1996). Replicating these TTP in dialogue systems can help to make them more efficient (e.g. (El Asri et al., 2014)) and enhance their ability to recover from misunderstandings (Skantze and Schlangen, 2009).
Several contributions already explored different TTP like end-point detection (Raux and Eskenazi, 2008), backchannels (Meena et al., 2014;Visser et al., 2014), feedback (Skantze and Schlangen, 2009) or barge-in (Selfridge et al., 2013;Ghigi et al., 2014). However, these studies have been performed separately with no unified view and no comparison of respective merits, importance and co-influence of the different TTP. In order to have a better grasp on the concept of turn-taking in a dialogue and a guideline for the implementation, we felt the need to introduce a taxonomy of these TTP. Our motivation is to clarify which TTP are interesting to implement given the task at hand. As an illustration, five TTP (which we assume have the best properties to improve the dialogue efficiency) have been implemented and compared in a slotfilling simulated environment.
Section 2 introduces the TTP taxonomy and Section 3 describes the simulated environment, the experimental setup and the results. We then conclude in Section 4.

Turn-taking phenomena taxonomy
In linguistics and philosophy of language, a distinction is made between two different levels of a speech act analysis: locutionary acts and illocutionary acts (Austin, 1962;Searle, 1969). Loosely speaking, a locutionary act refers to the act of uttering sounds without taking their meaning into account. When the semantic information is the object of interest, it is an illocutionary act. In (Raux and Eskenazi, 2009), four basic turn-taking transitions are presented: the turn transitions with gap, the turn transitions with overlap, the failed interruptions and the time outs where only the mechanics of turn-taking are studied at a locutionary level. In (Gravano and Hirschberg, 2011), the authors propose a turn-taking labeling scheme, which is a modified version of the original classification of interruptions and smooth speaker-switches introduced in (Beattie, 1982). This classification is richer than the one in (Raux and Eskenazi, 2009) as the meaning of the turn-taker utterance  T REF IMPL  T REF RAW  T REF INTERP  T MOVE  G NONE  FLOOR TAKING IMPL  INIT DIALOGUE  G FAIL  FAIL IMPL  FAIL RAW  FAIL INTERP  G INCOHERENCE  INCOHERENCE IMPL  INCOHERENCE RAW  INCOHERENCE is taken into account. From a computational point of view, it is more interesting to add high-level information to classify these behaviours as semantics clearly influence turn-taking decisions (Duncan, 1972;Gravano and Hirschberg, 2011). In this paper, a more fine-grained taxonomy of TTP is introduced where utterances are considered both at locutionary and illocutionary levels. During a floor transition, the person who starts speaking will be called T (Taker) whereas the person that was speaking just before will be called G (Giver). At the beginning of the dialogue, the person that initiates the dialogue will be called T and the other G by convention. We classify TTP given two criteria: the quantity of information that has been injected by G before the floor transition (rows in Table 1) and the quantity of information that T tries to add by taking the floor (columns in Table 1). Table 2 gives the meaning of the different criteria's labels. At the beginning of the dialogue (G NONE), T can implicitly announce that she wants to take the floor by using hand gestures or by clearing her throat for instance (FLOOR TAKING IMPL), or she can directly initiate the dialogue (INIT DIALOGUE). If G is already speaking, her message can be not understandable by T (G FAIL). T can warn G implicitly by frowning for example (FAIL IMPL) or explicitly, in a raw manner by saying Sorry? (FAIL RAW) or by pointing out what has not been under-stood (FAIL INTERP). In addition, even if the meaning of the message has been understood, it can be incoherent with the interaction context (G INCOHERENT, e.g. trying to book a flight from a city with no airport). Again, T can warn G implicitly (INCOHERENCE IMPL)  In the case G's utterance is not problematic but yet incomplete (G INCOMPLETE), T can let her understand that she understands what has been said so far by performing a BACKCHAN-NEL (Yes, uhum etc.), by repeating his words exactly (FEEDBACK RAW) or by commenting them (FEEDBACK INTERP), for example: Yesterday I went to this new Chinese restaurant in town... / Yeah Fing Shui / ...and it was a pretty good deal). If G utters enough information to move the dialogue forward (G SUFFICIENT), T can refer to an element in G's utterance implicitly (Aha) by reacting at the proper timing (REF IMPL), or explicitly in a raw (REF RAW, for example Ok, Sunday) or interpreted manner (REF INTERP, for example Yeah, Sunday is the only day when I am free). T can also interrupt G to add some information that is relevant to the course of the dialogue (BARGE IN RESP). Finally, she can wait until G has finished his utterance (G COMPLETE) and warn him that he should add more information (REKINDLE, for example: And?) or start a new dialogue turn (END POINT).
In the rest of this paper, five incremental TTP that are the more used in general, and therefore studied, have been tested in a simulated environment: FAIL RAW (Ghigi et al., 2014) 3 Simulation 3.1 Service task A personal agenda assistant task has been implemented in the simulated dialogue system (referred to as the service hereafter). The user can add events to her agenda as long as they do not overlap with existing events. She can also move events in the agenda or delete them (ADD, MODIFY and DELETE actions). An event corresponds to a title, a date, a time slot, a priority, and the list of alternative dates and time slots where the event can fit, in the case the main date and slot are not available. For example: {title: house cleaning, date: January 6 th , slot: from 18 to 20, priority: 3, alternative 1: January 7 th , from 18 to 20, alternative 2: January 9 th , from 10 to 12}.

Overview
The architecture of the User Simulator (US) is built around five modules: the Natural Language Understanding (NLU) module, the Intent Manager, the Natural Language Generator (NLG), the Verbosity Manager, the ASR Output Simulator, and the Patience Manager. These modules are described in the following.
NLU module: The NLU module is very simple as the service's utterances are totally known by the US and no parsing is involved. Each one of them is associated with a specific dialogue act.
Intent Manager: The Intent Manager is somehow the brain of the US, as it determines its next intent given the general goal and the last NLU result. The general goal depends on the scenario at hand, which is in turn determined by two lists of events: the initial list (InitList) and the list of events to add to the agenda during the dialogue (ToAddList). The Intent Manager tries to add each event from the latter given the constraints imposed by the former. If the events of both lists cannot be kept, those with lower priorities are abandoned or deleted until a solution is reached.
The service asks for the different slot values in a mixed initiative way. At first, the user has the initiative in the sense that she is asked to provide all the slot values in the same utterance. If there is still missing information (because the user did not provide all the slot values or because of ASR noise), the remaining slot values are asked for one by one (system initiative).
NLG module: The NLG figures out the next sentence to utter given the current Intent Manager's output. A straightforward sentence is computed, for example, Add the event meeting Mary on July 6 th from 18:00 until 20:00.
Verbosity Manager: The Verbosity Manager randomly expands the NLG output with some usual prefixes (like I would like to...) and suffixes (like please, if possible...). Also, a few sentences are replaced with off-domain words or repeated twice as it is the case in real dialogues (Ghigi et al., 2014). For questions concerning a specific slot, neither prefixes nor suffixes are added.
Patience Manager: When the dialogue lasts too long, the US can get impatient and hang up. The US patience corresponds to a threshold on each task duration. It is randomly sampled around a mean of 180 seconds for the experiments. A speech rate of 200 words per minute is assumed for the dialogue duration estimation (Yuan et al., 2006). Moreover, a silence of one second is assumed at each regular system/user transition and a two second silence is assumed the other way round. For interruptions and accurate end-point detection, no silence is taken into account.

ASR Output Simulator
The US can run either in a traditional mode in the sense that it provides a complete utterance to the system then it waits for a response, or in an incremental mode where a growing utterance is outputted at each new word. For example: I, I want, I want to, I want to add...etc. In incremental dialogue systems, the turn increment (called the micro-turn in this case) could be different than the word (a small duration for example).
The ASR output simulator can be used in both modes, but as the traditional mode is a special case of the incremental one, we describe the latter only. This module computes a noisy version of each word (substitution, deletion, or insertion). It also associates a confidence score with each new partial utterance. Moreover, a word in the ASR output can change later as new words pop in (Selfridge et al., 2011;McGraw and Gruenstein, 2012). In the following, this mechanism is referred to as the ASR instability. At each micro-turn, the sys-tem can keep listening to the US or decide to take the floor (see Section 3.4).

Scheduler
The same architecture as in  is used. A Scheduler module is inserted between the service and the user simulator. As the ASR output utterance grows, the partial utterances are sent, at each micro-turn, to the Scheduler. In turn, the latter transfers them to the service and waits for its responses.
The aim of this module is to make turn-taking decisions. Given the last system's response and some other features and rules determined by the designer, or learned from data, the Scheduler decides whether or not to convey that response to the client immediately or not.

Dialogue example
In the following example, the user has to delete an event, before adding another one (ASR noise is not introduced here): SYSTEM: Hi. Welcome to your agenda management service. How can I help you?
USER: I would like to add the event birthday party on January 6 th from 6 pm to 11 pm if it is possible. SYSTEM: The time slot from 6 pm to 11 pm on January 6 th overlaps with the event house cleaning on January 6 th from 7 pm to 9 pm. How can I help you?

TTP implementation
Replicating some turn-taking phenomena like backchannels makes the system seems more realistic (Meena et al., 2014). In this work, the focus is on dialogue efficiency, therefore, the following TTP have been chosen for the implementation: FAIL RAW, INCOHERENCE INTERP, FEEDBACK RAW and BARGE IN RESP from the user's and the system's point of view. At each micro-turn, the system has to pick an action among three options: to wait (WAIT), to retrieve the last service's response to the client (SPEAK) or to repeat the word at position n − 2 (if n is the current number of words, because of the ASR instability) in the current partial request (REPEAT). To replicate each selected TTP, a set of rules have been specified to make the proper decision. We review the triggering features related to each TTP accommodated to the task at hand (agenda filling).
FAIL RAW: Depending on the last system's dialogue act, a threshold relative to the number of words without detecting a key concept in the utterance has been set. In the case of an open question (where the system waits for all the information needed in one request), if no action type has been detected after 6 words, a FAIL RAW event is declared. The system waits for 3 words in the case of a yes/no question, for 4 words in the case of a date and for 6 words in the case of slots (some concepts need more words to be detected and the user may use additional off-domain words).
INCOHERENCE INTERP: This event is useful to promptly react to partial requests that would eventually lead to an error, not because they were not correctly understood, but because they are in conflict with the current dialogue state. If such an inconsistency is detected, the system waits for two words (ASR instability) and if it is maintained, it takes the floor to warn the user.
FEEDBACK RAW: If at time t, a new word is added to the partial utterance and the ratio between the last partial utterance's score and the one before last (which corresponds to the score of the last increment) is lower than 1/2, then the system waits for two words (because of the ASR instability), and if the word is still in the partial utterance, a REPEAT action is performed.
BARGE IN RESP (System): This TTP depends on the last system dialogue act as it determines which kind of NLU concept the system is Once it is detected, the system waits for two more words (ASR instability) and if the concept is maintained, it performs a SPEAK. USER BARGE RESP (User): This event is triggered directly by the user (no system decision is involved). For each system dialogue act, the moment when a familiar user would barge-in is manually defined in the simulator.
Dialogue duration and task completion are used as evaluation criteria. The task completion rate is the ratio between the number of dialogues where the user did not hang up (because of her patience limit) and the total number of dialogues.
The five implemented TTP have been tested single-handled and in an aggregated manner (referred to as All strategy). They have also been been compared to a non-incremental baseline (see Figure 1 and 2). Three dialogue scenarios and different WER levels were tested. For each strategy and each WER, 1000 dialogues have been simulated for each scenario. Figure 1 (resp. Figure 2) represents the mean duration (resp. the mean task com-pletion), with the corresponding 95% confidence intervals, for the different strategies and for WER varying between 0 and 0.3.
The FEEDBACK RAW strategy performs best whereas INCOHERENCE INTERP does not improve over the baseline. This is due to the fact that the system has to deal with an open slot (which set of possible values is not closed and known a priori): the event's description. The system mostly performs ADD actions, so the description slot can take any value and is never compared with existing data. This is the case of many application like message dictation for example. However, in the case of service at hand, an initial concept must be detected (the action), therefore, FAIL RAW improves the performance. BARGE IN RESP from user's side is also useful here as dialogues can be long and may contain repetitive system dialogue acts. The users get familiar with the systems and may infer the end of the system's question before it ends. Obviously it is questionable that users may be patient enough (up to several minutes) to achieve such simple tasks in real life. But for the sake of the simulation it was necessary to generate dialogues long enough to have the studied TTP influence them. In a next step, increasing the service capacities (and complexity) will remedy that as a side effect. Finally, BARGE IN RESP from the system's side does not bring any improvement either which is due to the fact that in this task and because of input noise, in most cases, the response to the initial open question is not enough to fill all the slots. The responses to single-slot questions do not contain suffixes which explains the inefficiency of the last strategy (the US stops speaking as soon as the slot value is given).

Conclusion and future work
This paper introduces a new taxonomy of turntaking phenomena in human dialogue. Then an experiment where five TTP are implemented has been run in a simulated environment. It illustrates the potentiality of the taxonomy and shows that some TTP are worth replicating in some situations but not all. In future work, we plan to perform TTP analysis in the case of real users and to optimise the hand-crafted rules introduced here to operate the floor management in the system (when to take/give the floor and according to which TTP scheme) by using reinforcement learning (Sutton and Barto, 1998;Lemon and Pietquin, 2012).