Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data

Emotion recognition has become a popular topic of interest, especially in the field of human computer interaction. Previous works involve unimodal analysis of emotion, while recent efforts focus on multimodal emotion recognition from vision and speech. In this paper, we propose a new method of learning about the hidden representations between just speech and text data using convolutional attention networks. Compared to the shallow model which employs simple concatenation of feature vectors, the proposed attention model performs much better in classifying emotion from speech and text data contained in the CMU-MOSEI dataset.


Introduction
Emotion not only is a key driver to people's actions and thoughts, but also is a fundamental part of human communication. As such, emotion recognition technology has become growingly important in improving how humans interact with machines [1]. For instance, emotion recognition has been applied to analyze people's reactions to advertisements, thus creating better neuromarketing campaigns [2]. It has also gained in popularity amongst various other domains such as healthcare [3], customer service, or gaming.
However, effective emotion recognition still remains a challenging task, due to the sheer complexity of generalizing human emotions. For example, individuals express and perceive emotions differently, depending on numerous personal characteristics such as but not limited to age [4], gender [5] and race. Previous efforts have used deep learning based approaches to analyze emotion from single mode of expression, such as facial expression [6] or speech [7]. Since deep learning based approaches have been proven to be effective at learning and generalizing data with high-dimensional feature spaces like images, similar efforts to capture complex feature space of emotional data have also shown promising results with several emotion databases such as EmoDB [8] or IEMOCAP [9]. Unfortunately, human emotion in real-life is often expressed through complex combination of multiple modes of expression, and a lot of information is lost by employing unimodal analysis.
To solve this problem, using deep learning based approaches for multimodal emotion recognition has been researched extensively in recent years. Work of Tzirakis et al. uses deep residual networks to extract features from facial expressions, convolutional neural networks to extract features from speech, and concatenates them to input into a LSTM network [10]. Work of Ranganathan et al. uses deep believe networks on facial expressions, body expressions, vocal expressions, and physiological signals [11].
Inspired by these approaches, we suggest a new approach to multimodal emotion recognition from just speech and text data. Feature vectors from embedded text sequences and speech spectrograms are extracted using convolutional neural network based architectures. A direct way to learn about the relationship between these two ------------------------------------------------* Corresponding Author: cchoi@orbisai.co feature vectors would be to utilize a shallow model, which is a simple concatenation of two feature vectors. However, since the correlations between feature vectors from speech and text is highly non-linear, it is difficult for a shallow model to properly learn multimodal representations. Therefore, we utilize trainable attention mechanisms to learn nonlinear correlations between these feature vectors. Attention mechanisms also help retain information in the timedomain by forming temporal embedding between two feature vectors. Since speech features and context shares the same time domain, using attention mechanism may help to discover new information for emotion classification. Attention models have previously been successfully applied to tasks such as image caption generation [12], machine translation [13], and speech recognition [14].
To demonstrate the benefits of this new approach, we use it to classify emotions from speech and text data provided in the CMU-MOSEI dataset into six classes: happy, angry, sad, surprised, disgusted, and fear [15]. We also compare this approach to the shallow model approach to show how the attention mechanism can improve capturing of multimodal correlations between text and speech.

Model
The attention network shown in figure 1 is comprised of three separate convolutional neural networks: one each for feature extraction from speech spectrogram and word embedding sequence, and one for emotion classifier. Outputs from each of the CNNs from word embedding and spectrogram are used to compute an attention matrix for representing word embedding's correlation to the spectrogram with respect to the emotion labelling. This attention matrix combined with the input spectrogram to be inputted into the CNN based classifier for emotion.
Input embedded word sequences have a size of ! "×$ (e: embedding size, L: max sequence length), while input spectrograms have a size of ! %×& (f: frequency range, t: time domain after FT). Word embedding size is fixed at 300, and raw text sentence length was capped at 40 words. Thereby, total word embedding sequence dimension results to 300 by 40. Input spectrograms are derived from transforming raw audio signals with a sample rate of 8000 Hz in the frequency ranges of 0~4kHz, with a fixed size of 200 x 400.
To find the attention matrix between the two feature vectors, 1 by 1 convolution is conducted before calculating the dot product. The resulting Figure 1 Attention Networks for multimodal representation learning between speech and text data for emotion classification. Separate CNNs are used to extract features from speech spectrograms and embedded word sequences. An attention matrix of m x n dimension is calculated by simply taking a softmax of the dot products of the feature vectors. This attention matrix is then multiplied to the spectrogram input, and goes through a third CNN for emotion classification. attention matrix has a size of m x n, determined by the last feature vector after 1 by 1 convolution. The column of the attention matrix is the attention of word sequence with respect to the spatial distribution of the input spectrogram. At the extend stage, feature dimensions that are lost due to max pooling in the convolutional layers is recovered. By broadcasting attention values by 2^P, where P is the number of max pooling layers applied, attention values applied to the entire width of the spectrogram.
Attention values are calculated using the following equations: which essentially is the input spectrogram with attention information added. As shown in Figure  1, the attention matrix can be constructed with m x n dimensions, and when visualized looks like Figure 2.
After the model learns the representation of each features for attention, the last CNN layer computes the weighted sum of all the information extracted from the attention input. The output vector is then fed into a fully connected softmax layer for classification.

Dataset
We use audio and text data from CMU-Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset for all experiments [15]. The videos, totaling 23,141 files, are chosen from YouTube speakers including various topics and monologue, and are gender balanced. Text embedding was prepared using GloVe word2vec method. Each word embedding is fixed at a length of 300. The duration of each word utterance is also provided by the P2FA forced alignment [15].

Data preprocessing
Speech raw signals are converted to spectrograms before being input into the attention network using Short Time Fourier Transform (STFT) after resampling with a reduced sample rate from 44100 Hz to 8000Hz, as seen in Figure 3. Hamming window is used during STFT, and the length of each segment is 800. The transformed spectrogram is then converted to log-scale to make the vertical axis units of dB, with a frame size of 200x400.

Experimental results
In this section, we describe the experiment methodologies and report the recognition performance proposed attention network architecture on the CMU-MOSEI dataset [15].

Hyperparameters
Stochastic gradient descent with a set learning rate is employed during training. For regularization, dropout is applied to the last hidden layer. The system's hyperparameters are: 32 kernels with 3 kernel size; a batch size of 32; a dropout rate of 0.1; learning rate of 1e-3; a pool size of 2 and stride of 2; the dense layer units after final CNN are 1024, 512, and 128 for all configurations.

Evaluation
For each experiment, we report an overall accuracy (each sentence across the dataset has an equal weight; weighted accuracy) and a class accuracy (first evaluated for each emotion and then averaged; unweighted accuracy). All the classification results are listed in Tables 1-2, including precision, recall, and f-1 score. Confusion matrices are also provided to show how well the model correctly classifies each emotion, using the top-1 class prediction as a metric.

Experiment 1: shallow model
In this section, we report the results of training the shallow model with the CMU-MOSEI dataset. Since the shallow model is a common and the simplest method of multimodal emotion classification, we use it as a baseline model for comparison.
The overall validation accuracy (weighted) is 83.11% and class validation accuracy (unweighted) is 77.23% as shown in Table 1. The multi-class confusion matrix is shown in Figure  5, showing the highest accuracies for anger and happy emotions, and lowest accuracies for fear and surprise emotions.

Emotion
Preci  Table 1 The results of shallow model

Experiment 2: attention model
In this section, we report the results of attention model to compare to the baseline results.
The overall accuracy (weighted) is 88.89% and class accuracy (unweighted) is 84.08 % as shown in Table 2 for the attention model, a significant improvement from the same metrics of shallow model. According to the confusion matrix shown in Figure 6, validation accuracies have increased throughout all emotion classes compared to the baseline.

Emotion
Preci  Table 2 The results of attention model

Discussion
Comparing the attention model to the shallow model, shallow model utilizes a superficial feature concatenation, while attention model calculates the similarity between two feature vectors that can be trained with learnable weights. In the context of the feature space, concatenating two feature vectors in the shallow model essentially is a simple increase in dimensionality. On the other hand, the feature space in the attention model is fixed to the audio feature space. However, since the features now depend on a new variable called attention, the model can selectively utilize different features in the audio feature space to different extents for better classification.
In other words, text data now plays an important role in determining whether a speech feature is important or not in classifying certain emotions, an especially important benefit for training datasets with limited size or data balance.

Figure 6 Confusion matrix of attention model
In addition, correlation information between text and speech with respect to the time domain can be easily lost when shallow concatenation is utilized. Meanwhile, calculation of the attention matrix requires matrix multiplication between embedded word and spectrogram feature for a given time. Hence, time series information is retained in the calculated attention matrix through temporal embedding, and to the resulting attention applied spectrogram. Since context and its vocal style of delivery plays an important role in communicating emotion, retaining the time information provides huge benefits in classifying emotions from just speech and text.
Furthermore, while the shallow model is merely an analysis of a union of text and speech infor-mation, the proposed attention model aims to discover new meaningful methods of how two feature vectors intersect. In other words, shallow model is highly single feature dependent, while attention model is not. This means that if each of the feature vectors contain inadequate information to begin with, shallow model will perform much worse than attention model. Since the attention model provides newly discovered correlation between the two feature vectors, this new information can be used in ensemble with the original text and speech feature vectors.
Of course, attention models aren't silver bullets in choosing the desired features and discarding the rest. Without careful training of the model, distribution of the attention values can flatten out. For instance, if the input data contains too much padding, and the network has a big bias causing little optimization, the feature vector used to calculate the attention values will approximate to 0, and subsequently attention values will also approximate to 0. One possible solution is the utilize loss masking on the padding of the input data so that a more dynamic softmax distribution in the attention matrix can be obtained.
It is worth noting that for both experiments, f-1 scores of select classes, namely happy and anger are much higher than those of other classes. This is mainly due to a considerable class imbalance of the training set, in which ~44% of the data is happy, and ~30% of the data is angry.

Conclusion
The attention model proposed for multimodal emotion recognition from speech and text data provides an effective method of learning about the correlation between the two output feature vectors from separate yet jointly trained CNNs. This method is especially effective for correlation information between speech and text, because the context and the way it is delivered plays a crucial role in affective communication, and the attention model retains temporal information well throughout its model. For future work, syncing the input text and speech data in the temporal dimension may help the attention network focus on learning the relationship between one speech segment and one word, instead of the relationship between whole speech segment and whole text segment.