Neural Dialogue Context Online End-of-Turn Detection

This paper proposes a fully neural network based dialogue-context online end-of-turn detection method that can utilize long-range interactive information extracted from both speaker’s utterances and collocutor’s utterances. The proposed method combines multiple time-asynchronous long short-term memory recurrent neural networks, which can capture speaker’s and collocutor’s multiple sequential features, and their interactions. On the assumption of applying the proposed method to spoken dialogue systems, we introduce speaker’s acoustic sequential features and collocutor’s linguistic sequential features, each of which can be extracted in an online manner. Our evaluation confirms the effectiveness of taking dialogue context formed by the speaker’s utterances and collocutor’s utterances into consideration.


Introduction
In human-like spoken dialogue systems, end-ofturn detection that determines whether a target speaker's utterance is ended or not is an essential technology (Sacks et al., 1974;Meena et al., 2014;Ward and Vault, 2015). It is widely known that heuristic end-of-turn detection based on nonspeech duration determined by speech activity detection (SAD) is insufficient for smooth turntaking (Hariharan et al., 2001).
Various methods have been examined for modeling the end-of-turn detection (Koiso et al., 1998;Shriberg et al., 2000;Schlangen, 2006;Gravano and Hirschberg, 2011;Sato et al., 2002;Guntakandla and Nielsen, 2015;Ferrer et al., 2002Ferrer et al., , 2003Atterer et al., 2008;Arsikere et al., 2014Arsikere et al., , 2015. A general approach is discriminative modeling using acoustic or linguistic features extracted from target speaker's current utterance. In addition, recent studies use recurrent neural networks (RNNs) as they are suitable for directly capturing long-range sequential features without manual specification of fixed length features such as maximum, minimum, average values of acoustic features or bag-of-words features (Masumura et al., 2017;Skantze, 2017) We note, however, that interlocutor's utterances are rarely used for end-of-turn detection. In dialogues, target speaker's utterances are definitely impacted by the interlocutor's utterances (Heeman and Lunsford, 2017). It is expected that we can improve end-of-utterance detection performance by capturing the "interaction" between the target speaker and the interlocutor.
In this paper, we propose a neural dialoguecontext online end-of-turn detection method that can flexibly utilize both target speaker's and interlocutor's utterances. To the best of our knowledge, this paper is the first study to utilize dialoguecontext information for neural end-of-turn detection. Although some natural language processing tasks recently examine dialogue-context modeling (Liu and Lane, 2017;Tran et al., 2017), they cannot handle multiple acoustic and lexical features individually extracted from both target speaker's and interlocutor's utterances. In the proposed method, target speaker's and interlocutor's multiple sequential features, and their interactions are captured by stacking multiple time-asynchronous long short-term memory RNNs (LSTM-RNNs). In order to achieve low-delayed end-of-turn detection in spoken dialogue systems, acoustic sequential features extracted from target speaker's speech and linguistic sequential features extracted from the interlocutor's (system's) responses are used for capturing interactive information.
In our experiments, human-human contact center dialogue data sets are used with the goal of constructing a human-like interactive voice response system. We show that the proposed method outperforms a variant that uses only target speaker's utterances.

Proposed Method
End-of-turn detection is the problem of detecting whether each end-of-utterance point is a turntaking point or not. The utterance is defined as an internal pause unit (IPU) if it is surrounded by non-speech units (Koiso et al., 1998). The speech/non-speech units are estimated by SAD.
In dialogue-context-based online end-of-turn detection, all past information of both target speaker's and interlocutor's utterances behind the speaker's current end-of-utterance can be utilized for extracting context information. The estimated label is either end-of-turn or not. The label of the t-th target speaker's end-of-utterance in a conversation can be decided by: where Θ denotes a model parameter.l (t) is the estimated label of the t-th speaker's end-ofutterance. S (1:t) represents speaker's utterances {S (1) , · · · , S (t) } where S (t) is the t-th utterance. C (1:t) represents interlocutor's utterances {C (1) , · · · , C (t) } where C (t) is the t-th utterance that occurred just before S (t) . Undoubtedly, there are some exceptional cases wherein the t-th interlocutor's utterance is none.
The t-th speaker's utterance involves N kinds of sequential features: where s (t) n represents the n-th sequential feature in S (t) , and a t n,i is the i-th frame's feature in s n . In the same way, the t-th interlocutor's utterance involves M kinds of sequential features: where c t m represents the m-th sequential feature in

Fully Neural Network based Modeling
This paper proposes a neural dialogue context online end-of-turn detection method that is modeled using fully neural networks. In order to model (l (t) |S (1:t) , C (1:t) , Θ), we extend stacked time asynchronous sequential networks that include multiple time-asynchronous LSTM-RNNs for embedding complete sequential information into a continuous representation (Masumura et al., 2017). In order to capture long-range dialogue context information, the proposed method employs two stacked time asynchronous sequential networks for both target speaker's and interlocutor's utterances. In addition, the proposed method introduces another sequential network to capture interactions of both side's utterances. Figure 1 details the structure of the proposed method. In the proposed method, each feature within an utterance is individually embedded into a continuous representation in an asynchronous manner. To this end, LSTM-RNNs are prepared for individual sequential features in both target speaker's and interlocutor's utterances. Each sequential information is embedded as: where A (t) n denotes a continuous representation that embeds the n-th sequential feature within the t-th target speaker's utterance. B (t) m denotes a continuous representation that embeds the n-th sequential feature within the t-th interlocutor's utterance. LSTM() represents a function of the unidirectional LSTM-RNN layer. θ A n and θ B m are model parameters for the n-th sequence in the target speaker's utterance and the m-th sequence in the interlocutor's utterance, respectively.
The continuous representations individually formed from each sequential feature are merged to yield an utterance-level continuous representation as follows: where x (t) and y (t) represent utterance-level continuous representations for the t-th target speaker's utterance and the t-th interlocutor's utterance, respectively.
In order to capture long-range contexts, target speaker's utterance-level continuous representations and interlocutor's utterance-level continuous representations are individually embedded into a continuous representation. The t-th continuous representation that embeds a start-of-dialogue and the current end-of-utterance is defined as: where X (t) denotes a continuous representation that embeds speaker's utterances behind the tth speaker's end-of-utterance, and Y (t) denotes a continuous representation that embeds interlocutor's utterances behind the t-th interlocutor's endof-utterance. θ X and θ Y are model parameters for the target speaker's utterance-level LSTM-RNN and the interlocutor's utterance-level LSTM-RNN, respectively. In addition, to consider the interaction between the target speaker and the interlocutor, both utterance-level continuous representations are additionally summarized as: where Z (t) denotes a continuous representation that embeds all dialogue context sequential information behind the t-th target speaker's end-ofutterance. θ Z represents the model parameter.
In an output layer, posterior probability of endof-turn detection in the t-th target speaker's endof-utterance is defined as: where SOFTMAX() is a softmax function, and θ O is a model parameter for the softmax function. O (t) corresponds to P (l (t) |S (1:t) , C (1:t) , Θ). Summarizing the above, Θ is represented as In training, the parameter can be optimized by minimizing the cross entropy between a reference probability and an estimated probability: l,d are a reference probability and an estimated probability of label l for the t-th end-of-utterance in the d-th conversation, respectively. D represents a training data set.

Features for Spoken Dialogue Systems
In neural dialogue-context-based online end-ofturn detection, various sequential features can be leveraged for capturing both target speaker's and interlocutor's utterances. In spoken dialogue systems, the interlocutor is the system. Therefore, lexical information generated by the system's response generation module can be utilized. This paper uses pronunciation sequences and word sequences as the interlocutor's sequential features. In the proposed modeling, we use both symbol sequences by converting them into continuous vectors. On the other hand, the target speaker's utterances are speech. This paper introduces fundamental frequencies (F0s), and senone bottleneck features inspired by Masumura et al. (2017). The senone bottleneck features, which extract phonetic information as continuous vector representations, offer strong performance without recourse to lexical features.

Experiments
This paper employed Japanese simulated contact center dialogue data sets instead of humancomputer dialogue data sets. The data sets include 330 dialogues and 6 topics. One dialogue means one telephone call between one operator and one customer, in which each speaker's speech was separately recorded. In order to simulated interactive voice response applications, we regard the operator as the interlocutor, and the customer as the target speaker. We divided each data set into speech units and non-speech units using an LSTM-RNN based SAD (Eyben et al., 2013) trained using various Japanese speech data. An utterance is defined as a unit surrounded by non-speech units whose   duration is more than 100 ms. Turn-taking points and backchannel points were manually annotated for all dialogues. The evaluation used 6-fold cross validation in which training and validation data were 5 topics and test data were 1 topic. Detailed setups are shown in Table 1 where #calls, #utterances, and #turns represent number of calls, utterances and end-of-turn points, respectively. To realize a comprehensive evaluation, we examined various conditions. In the proposed modeling, unit size of LSTM-RNNs was unified to 256. For training, the mini-batch size was set to 2 calls. The optimizer was Adam with the default setting. Note that a part of the training sets were used as the data sets employed for early stopping. We constructed five models by varying an initial parameter for individual conditions and evaluated the average performance. When using either target speaker's utterances or interlocutor's utterances, required components were only used for building the proposed modeling. We used following sequential features. F0 represents 2 dimensional sequential features of F0 and ∆F0; frame shift was set to 5 ms. SENONE represents 256-dimensional senone bottleneck features extracted from 3-layer senone LSTM-RNN with 256 units trained from a corpus of spontaneous Japanese speech (Maekawa et al., 2000). Its frame shift was set to 10 ms, and the bottleneck layer was set to the third LSTM-RNN layer. PRON represents pronunciation sequences, and WORD represents word sequences of interlocutor's utterances. The lexical features were introduced by converting them into 128 dimensional vectors through linear transformation that was also optimized in training. Table 2 shows the experimental results. We used the evaluation metrics of recall, precision, macro F-value, and accuracy. The results gained when using only target speaker's utterances are shown in (1)-(3). In terms of F-value and accuracy, (3) outperformed (1) and (2). This confirms that stacked time-asynchronous sequential network based modeling is effective for combining multiple sequential features. The results gained when using only interlocutor's utterances are shown in (4)-(6). Among them, (6) attained the best performance although its performance was inferior to (1)-(3). In fact, (4)-(6) outperformed random end-of-turn decision making. This indicates interlocutor's utterances are effective in improving online end-of-turn detection performance. The proposed method, which takes both target speaker's and interlocutor's utterances into consideration, is shown in (7) and (8). In terms of Fvalue and accuracy, (7) outperformed (2) and (5).

Results
These results indicate that interaction information is effective for detecting end-of-turn points. The best results were attained by (8), which utilized both multiple target speaker's features and multiple interlocutor's features. The sign test results verified that (8) achieved statistically significant performance improvement (p < 0.05) over (3).

Conclusions
In this paper, we proposed a neural dialogue context online end-of-turn detection method. Main advance of the proposed method is taking long-range interaction information between target speaker's and interlocutor's utterances into consideration. In experiments using contact center dialogue data sets, the proposed method, which leveraged both target speaker's multiple acoustic features and interlocutor's multiple lexical features, achieved significant performance improvement compared to a method that only utilized target speaker's utterances.