Real-Time Speech Emotion and Sentiment Recognition for Interactive Dialogue Systems

In this paper, we describe our approach of enabling an interactive dialogue system to recognize user emotion and sentiment in real-time. These modules allow otherwise conventional dialogue systems to have “empathy” and answer to the user while being aware of their emotion and intent. Emotion recognition from speech previously consists of feature engineering and machine learning where the ﬁrst stage causes delay in decoding time. We describe a CNN model to extract emotion from raw speech input without feature engineering. This approach even achieves an impressive average of 65.7% accuracy on six emotion categories, a 4.5% improvement when compared to the conventional feature based SVM classiﬁcation. A separate, CNN-based sentiment analysis module recognizes sentiments from speech recognition results, with 82.5 F-measure on human-machine dialogues when trained with out-of-domain data.


Introduction
Interactive dialogue systems and chatbots have been around for a while. Some, though not all, systems have statistical and machine learning modules to enable them to improve overtime. With the pervasiveness of such systems on mobile devices, expectations of user experience have also increased. We expect human-machine dialogues to get closer to human-human dialogues. One important factor is that we expect machines to understand our emotions and intent and respond with empathy.
We propose a module of emotion and sentiment recognition for an interactive dialogue system. This module enables the system to assess the user's current emotional state and sentiment, and thereby decide the appropriate response at every dialogue state. The dialogue management system handles the mixed-initiative dialogues while taking into account user emotion and sentiment, in addition to query content. Emotion and sentiment recognition enables our system to handle user queries previously unseen in training data. Positive user queries containing positive emotion and sentiment label would have a positive response, and similarly a negatively labeled statement would have a negative response. Examples are shown below: User: I lost my job.
Response: Sorry to hear that. Success is in never giving up. User: I just graduated from college! Response: Congratulations! I am happy for you. User: I went on a vacation last month and it was pretty bad, I lost all my luggage Response: That doesn't sound so good. Hope your next vacation will be a good one. User: My last vacation was amazing, I loved it! Response: That sounds great. I would like to travel with you.
Meanwhile, dialogue systems like this need to have real-time recognition of user emotion and sentiment. Previous approaches of emotion recognition from speech involve feature engineering (Schuller et al., 2009; as a first step which invariably causes delay in decoding. So we are interested in investigating a method to avoid feature engineering and instead use a Convolutional Neural Network to extract emotion from raw audio input directly.

Speech Recognition
Our acoustic data is obtained from various public domain corpora and LDC corpora, comprised of 1385hrs of speech. We use Kaldi speech recognition toolkit (Povey et al., 2011) to train our acoustic models. We train deep neural network hidden Markov models (DNN-HMMs) using the raw audio together with encode-decode parallel audio. We apply layer-wise training of restricted Boltzmann machines (RBM) (Hinton, 2010), frame cross-entropy training with mini-batch stochastic gradient descent (SGD) and sequence discriminative training using state Minimum Bayes Risk (sMBR) criterion.
The text data, of approximately 90 million sentences, includes acoustic training transcriptions, filtered sentences of Google 1 billion word LM benchmark (Chelba et al., 2013), and other multiple domains (web news, music, weather). Our decoder allows streaming of raw audio or CELP encoded data through TCP/IP or HTTP protocol, and performs decoding in real time. The ASR system achieves 7.6% word error rate on our clean speech test data 1 .

Real-Time Emotion Recognition from Time-Domain Raw Audio Input
In recent years, we have seen successful systems that gave high classification accuracies on benchmark datasets of emotional speech (Mairesse et al., 2007) or music genres and moods (Schermerhorn and Scheutz, 2011). Most of such work consists of two main steps, namely feature extraction and classifier learning, which is tedious and time-consuming. Extracting high and low level features (Schuller et al., 2009), and computing over windows of audio signals typically takes a few dozen seconds to do for each utterance, making the response time less than realtime instantaneous, which users have come to expect from interactive systems. It also requires a lot of hand tuning. In order to bypass feature engineering, the current direction is to explore methods that can recognize emotion or mood directly from timedomain audio signals. One approach that has shown great potential is using Convolutional Neural Networks. In the following sections, we compare an approach of using CNN without feature engineering to a method that uses audio features with a SVM classifier.

Dataset
For our experiments on emotion recognition with raw audio, we built a dataset from the TED-LIUM corpus release 2 (Rousseau et al., 2014). It includes 207 hours of speech extracted from 1495 TED talks. We annotated the data with an existing commercial API followed by manual correction. We use these 6 categories: criticism, anxiety, anger, loneliness, happiness, and sadness. We obtained a total of 2389 segments for the criticism category, 3855 for anxiety, 12708 for anger, 3618 for loneliness, 8070 for happy and 1824 for sadness. The segments have an average length slightly above 13 seconds.

Convolutional Neural Network model
The Convolutional Neural Network (CNN) model using raw audio as input is shown in Figure 1. The raw audio samples are first down-sampled at 8 kHz, in order to optimize between the sampling rate and representation memory efficiency in case of longer segments. The CNN is designed with a single filter for real-time processing. We set a convolution window of size 200, which corresponds to 25 ms, and an overlapping step size of 50, equal to around 6 ms. The convolution layer performs the feature extraction, and models the variations among neighboring, overlapping frames. The subsequent maxpooling combines the contributions of all the frames, and gives as output a segment-based vector. This is then fed into a fully connected layer before the final softmax layer. These last layers perform a similar function as those of a fully connected Deep Neural Network (DNN), mapping the max-pooling output into a probabilistic distribution over the desired emotional output categories.
During decoding the processing time increases linearly with the length of the audio input segment. Thus the largest time contribution is due to the computations inside the network (He and Sun, 2015), which with a single convolution layer can be performed in negligible time for single utterances.  Convolutional Neural Networks (CNNs) have recently achieved remarkably strong performance also on the practically important task of sentence classification (Johnson and Zhang, 2014;Kalchbrenner et al., 2014;Kim, 2014). In our approach, we use a CNN-based classifier with Word2Vec to analyze the sentiment of recognized speech. We train a CNN with one layer of convolution and max pooling (Collobert et al., 2011) on top of word embedding vectors trained on the Google News corpus  of size 300. We apply on top of the word vectors a convolutional sliding window of size 3, 4 and 5 to represent multiple features. We then apply a max-pooling operation over the output vectors of the convolutional layer, that allows the model to pick up the most valuable information wherever it happens in the input sentence, and give as output a fixed-length sentence encoding vector.
We employ two distinct CNN channels: the first uses word embedding vectors directly as input, while the second fine-tunes them via back propagation (Kim, 2014). All the hidden layer dimensions are set to 100. The final softmax layer takes as input the concatenated sentence encoding vectors of the two channels, and gives as output is the probability distribution over a binary classification for sentiment analysis of text transcribed from speech by our speech recognizer.
To improve the performance of sentiment classification in real time conversation, we compare the performance on the Movie Review dataset used in Kim (2014) with the Twitter sentiment 140 2 dataset. This twitter dataset contains a total of 1.6M sentences with positive and negative sentiment labels. Before training the CNN model we apply some preprocessing as mentioned in Go et al. (2009).

Experimental setup
For the speech emotion detection module we setup our experiments as binary classification tasks, in which each segment is classified as either part of a particular emotion category or not. For each category the negative samples were chosen randomly from the clips that did not belong to the positive category. We took 80% of the data as training set, and 10% each as development and test set. The development set was used to tune the hyperparameters and determine the early stopping condition. We implemented our CNN with the THEANO framework    (Bergstra et al., 2010). We chose rectified linear as the non-linear function for the hidden layers, as it generally provided better performance over other functions. We used standard backpropagation training, with momentum set to 0.9 and initial learning rate to 10 −5 . As a baseline we used a linear-kernel SVM model from the LibSVM (Chang and Lin, 2011) library with the INTERSPEECH 2009 emotion feature set (Schuller et al., 2009), extracted with openSMILE (Eyben et al., 2010). These features are computed from a series of input frames and output a single static summary vector, e.g, the smooth methods, maximum and minimum value, mean value of the features from the frames (Liscombe et al., 2003). A similar one-layer CNN setup was used also for the sentiment module, again with rectified linear as the activation function. As our dataset contains many neutral samples, we trained two distinct CNNs: one for positive sentiment and one for negative, and showed the average results among the two categories. For each of the two training corpora we took 10% as development set. We used as baseline a method that uses positive and emotion keywords from the Linguistic Inquiry and Word Count (LIWC 2015) dictionary (Pennebaker et al., 2015).

Speech emotion recognition
Results obtained by this module are shown in Table 1. In all the emotion classes considered our CNN model outperformed the SVM baseline, sometimes marginally (in the angry and sad classes), sometimes more significantly (happy and criticism classes). It is particularly important to point out that our CNN does not use any kind of preprocessed features. The lower results for some categories, even on the SVM baseline, may be a sign of inaccuracy in manual labeling. We plan to work to improve both the dataset, with hand-labeled samples, and periodically retrain the model as ongoing work.
Processing time is another key factor of our system. We ran an evaluation of the time needed to perform all the operations required by our system (down-sampling, audio samples extraction and classification) on a commercial laptop. The system we used is a Lenovo x250 laptop with a Intel i5 CPU, 8 Gb RAM, an SSD hard disk and running Linux Ubuntu 16.04. Our classifier took an average of 162 ms over 10 segments randomly chosen from our corpus of length greater than 13 s, which corresponds to 13 ms per second of speech, hence achieving real-time performance on typical utterances. The key of the low processing time is the lightweight structure of the CNN, which uses only one filter. We replicated the evaluations with the same 10 segments on a two-filter CNN, where the second filter spans over 250 ms windows. Although we obtained higher performance with this structure in our preliminary experiments, the processing time raised to 6.067 s, which corresponds to around 500 ms per second of speech. This is over one order of magnitude higher than the one filter configuration, making it less suitable for time constrained applications such as dialogue systems.

Sentiment inference from ASR
Results obtained by this module are shown in Table 3. Our CNN model got a 6.1% relative improvement on F-score over the baseline when trained with the larger Twitter dataset. The keyword based method got a slightly better accuracy and precision and a much lower recall on our relatively small human-machine dialogue dataset (821 short utter-ances). However, we noticed that the keyword based method accuracy fell sharply when tested on the larger Twitter dataset we used to train the CNN, yielding only 45% accuracy. We also expect to improve our CNN model in the future training it with more domain specific data, something not possible with a thesaurus based method.

Conclusion
In this paper, we have introduced the emotion and sentiment recognition module for an interactive dialog system. We described in detail the two parts involved, namely speech emotion and sentiment recognition, and discussed the results achieved. We have shown how deep learning can be used for such modules in this architecture, ranging from speech recognition, emotion recognition to sentiment recognition from dialogue. More importantly, we have shown that by using a CNN with a single filter, it is possible to obtain real-time performance on speech emotion recognition at 65.7% accuracy, directly from time-domain audio input, bypassing feature engineering. Sentiment analysis with CNN also leads to a 82.5 F-measure when trained from out-of-domain data. This approach of creating emotionally intelligent systems will help future robots to acquire empathy, and therefore rather than committing harm, they can act as friends and caregivers to humans.