Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos

Emotion recognition in conversations is crucial for the development of empathetic machines. Present methods mostly ignore the role of inter-speaker dependency relations while classifying emotions in conversations. In this paper, we address recognizing utterance-level emotions in dyadic conversational videos. We propose a deep neural framework, termed Conversational Memory Network (CMN), which leverages contextual information from the conversation history. In particular, CMN uses multimodal approach comprising audio, visual and textual features with gated recurrent units to model past utterances of each speaker into memories. These memories are then merged using attention-based hops to capture inter-speaker dependencies. Experiments show a significant improvement of 3 − 4% in accuracy over the state of the art.


Introduction
Development of machines with emotional intelligence has been a long-standing goal of AI. With the increasing infusion of interactive systems in our lives, the need for empathetic machines with emotional understanding is paramount. Previous research in affective computing has looked at dialogues as an essential basis to learn emotional dynamics (Sidnell and Stivers, 2012;Poria et al., 2017a;Zhou et al., 2017).
Since the advent of Web 2.0, dialogue videos have proliferated across the internet through platforms like movies, webinars, and video chats. Emotion detection from such resources can benefit numerous fields like counseling (De Choudhury et al., 2013), public opinion mining , financial forecasting (Xing et al., 2018), and intelligent systems such as smart homes and chatbots (Young et al., 2018).
In this paper, we analyze emotion detection in videos of dyadic conversations. A dyadic conversation is a form of a dialogue between two entities. We propose a conversational memory network (CMN), which uses a multimodal approach for emotion detection in utterances (a unit of speech bound by breathes or pauses) of such conversational videos.
Emotional dynamics in a conversation is known to be driven by two prime factors: self and interspeaker emotional influence (Morris and Keltner, 2000;Liu and Maitlis, 2014). Self-influence relates to the concept of emotional inertia, i.e., the degree to which a person's feelings carry over from one moment to another (Koval and Kuppens, 2012). Inter-speaker emotional influence is another trait where the other person acts as an influencer in the speaker's emotional state. Conversely, speakers also tend to mirror emotions of their counterparts (Navarretta et al., 2016). Figure 1 provides an example from the dataset showing the presence of these two traits in a dialogue.
Existing works in the literature do not capitalize on these two factors. Context-free systems infer emotions based only on the current utterance in the conversation (Bertero et al., 2016). Whereas, state-of-the-art context-based networks like Poria et al., 2017b, use long short-term memory (LSTM) networks to model speaker-based context that suffers from incapability of long-range summarization and unweighted influence from context, leading to model bias.
Our proposed CMN incorporates these factors by using emotional context information present in the conversation history. It improves speakerbased emotion modeling by using memory networks which are efficient in capturing long-term dependencies and summarizing task-specific details using attention models (Weston et al., 2014;Graves et al., 2014;Young et al., 2017).
Specifically, the memory cells of CMN are continuous vectors that store the context information found in the utterance histories. CMN also models interplay of these memories to capture interspeaker dependencies.
CMN first extracts multimodal features (audio, visual, and text) for all utterances in a video. In order to detect the emotion of a particular utterance, say u i , it gathers its histories by collecting previous utterances within a context window. Separate histories are created for both speakers. These histories are then modeled into memory cells using gated recurrent units (GRUs).
After that, CMN reads both the speaker's memories and employs attention mechanism on them, in order to find the most useful historical utterances to classify u i . The memories are then merged with u i using an addition operation weighted by the attention scores. This is done to model inter-speaker influences and dynamics. The whole cycle is repeated for multiple hops and finally, this merged representation of utterance u i is used to classify its emotion category. The contributions of this paper can be summarized as follows: 1. We propose an architecture, termed CMN, for emotion detection in a dyadic conversation that considers utterance histories of both the speaker to model emotional dynamics. The architecture is extensible to multi-speaker conversations in formats such as textual dialogues or conversational videos.
2. When applied to videos, we adopt a multimodal approach to extract diverse features from utterances. It also makes our model robust to missing information.
3. CMN provides a significant increase in accuracy of 3 − 4% over previous state-of-the-art networks. One variant called CMN self which does not consider the inter-speaker relation in emotion detection also outperforms the state of the art by a significant margin.
The remainder of the paper is organized as follows: Section 2 provides a brief literature review; Section 3 formalizes the problem statement; Section 4 describes the proposed method in detail; ex-  Figure 1: An abridged dialogue from the dataset. Person A (wife) is leaving B (husband) for a work assignment. Initially both A and B are emotionally driven by their own emotional inertia. In the end, emotional influence can be seen when B, despite being sad, reacts angrily to A's angry statement.
perimental results are covered in Section 5; finally, Section 6 provides concluding remarks.

Related Works
Over the years, emotion recognition as an area of research has seen contributions from researchers across varied fields like signal processing, machine learning, cognitive and social psychology, natural language processing, etc. (Picard, 2010). Ekman, 1993, provided initial findings that related facial expressions as universal indicators of emotions. Rothkrantz, 2008, 2011, showed the importance of acoustic cues in affect modeling.
A large section of researchers approaches emotion recognition from a multimodal learning perspective. Hence, many works used visual and audio features together for detecting affect (Busso et al., 2004;Castellano et al., 2008;Ranganathan et al., 2016). An in-depth review of the literature in these systems is provided by D'mello and Kory, 2015. Our work, which performs context-sensitive recognition  uses three modalities: audio, visual and text. Recently, this combination of modalities has provided the best performance in affect recognition systems (Poria et al., 2017b;Wang et al., 2017;Tzirakis et al., 2017), thus motivating the use of a multimodal approach.
Previous works have focused on conversations as a resourceful event for emotion analysis. Ruusuvuori, 2013, provides an in-depth analysis on how emotions affect social interactions and conversations. In fact, significant works have attributed emotional dynamics as an interactive phe-nomenon, rather than being within-person and one-directional (Richards et al., 2003;Hareli and Rafaeli, 2008). Such emotional dynamics are modeled by observing transition properties. Yang et al., 2011, study patterns for emotion transitions and show the evidence of emotional inertia. Xiaolan et al., 2013, use finite state machines to model transitions using stimuli and personality characteristics. Our work also tries to model emotional transitions using multimodal features. Unlike these works, however, we use memory networks to achieve the same.
The use of memory networks have been instrumental in the progress of multiple research problems, e.g., question-answering (Weston et al., 2014;Sukhbaatar et al., 2015;Kumar et al., 2016), machine translation , speech recognition (Graves et al., 2014), and commonsense reasoning . The repeated read and write to their memory cells is often coupled with attention modules, thus allowing it to filter only relevant memories.
Our model is loosely inspired from Sukhbaatar et al., 2015. Unlike their model, which directly encodes sentences into memories, we perform temporal sequence processing on our utterance histories using GRUs. We also extend their architecture to handle two speakers while keeping the possibility to add more. Finally, our model is different in the fact that we use multimodal features for input and processing.

Task Definition
Our goal is to infer the emotion of utterances present in a dyadic conversation. Let us define a dyadic conversation to be an asynchronous exchange of utterances between two persons P a and P b . Both the speakers speak a sequence of utterances U a and U b , respectively. Here, U λ = (s 1 λ , s 2 λ , ..., s l λ λ ) is ordered temporally, where s i λ is the i th utterance by P λ and l λ is the total number of utterances spoken by person P λ , λ ∈ {a, b}. Overall, the utterances by both speakers can be linearly ordered based on temporal occurrence as Our model takes as input an utterance u i whose emotion category (Section 5.1) needs to be classified. To get its history, preceding K utterances of each person are separately collected as hist a and hist b . Here, K serves as the length of the context window for history of u i . Thus, for λ ∈ {a, b}: hist λ is also ordered temporally. At the beginning of the conversation, histories would have lesser than K utterances, i.e., hist λ < K.
In the remaining sections, for brevity, we explain the processes using a subscript λ which can instantiate to either a or b, i.e., λ ∈ {a, b}.

Approach
We start by detailing the multimodal feature extraction scheme for all utterances followed by the mechanism to model emotional context using memory networks.

Multimodal Feature Extraction
The first phase of CMN is to extract multimodal features of all utterances in the conversations. The dyadic conversations are present in the form of videos. Each utterance of a particular conversation is thus a small segment of the full video. For each utterance, we extract features for the modes: audio, visual and text. The process of feature extraction for each mode is described below.

Textual Features Extraction
We extract features from the transcript of an utterance video using convolutional neural networks (CNNs). CNNs are effective in learning high level abstract representations of sentences from constituting words or n-grams (Kalchbrenner et al., 2014). To get our sentence representation, we use a simple CNN with one convolutional layer followed by max-pooling (Kim, 2014;Poria et al., 2016).
Specifically, the convolution layer consists filters of sizes 3, 4 and 5 with 50 feature maps each. Max-pooling is employed on these feature maps with a pooling window of size 2. Finally, a fully connected layer is used with 100 neurons. The activations of this layer form our sentence representation t u .

Audio Feature Extraction
To extract audio features we use openSMILE . It is an open-source software which provides high dimensional audio vectors. These vectors comprise of features like loudness, Mel-spectra, MFCC, pitch, etc. Audio features play a significant role in providing information on the emotional state of a speaker (Song et al., 2004).
In fact, the literature shows that there exists a high correlation between many statistical measures of speech with speakers' emotion. For example, high pitch and fast speaking rate often denote anger while sadness associates low standard deviation of pitch and slow speech rate (Dellaert et al., 1996;Amir, 1998). In this work, we use the IS13 ComParE 1 config file which extracts a total of 6373 features for each utterance video. Zstandardization is performed for voice normalization and dimension of the audio vector is reduced to 100 using a fully-connected neural layer. This provides the final audio feature vector a u .

Visual Feature Extraction
Facial expressions and visual surrounding provide rich emotional indicators. We use a 3D-CNN to capture these details from the utterance video. Apart from the benefits of extracting relevant features from each image frame, 3D-CNN also extracts spatiotemporal features across frames (Tran et al., 2015). This leads to the identification of emotional expressions like a smile or frown.
The working of a 3D-CNN is identical to its 2D counterpart with an input being a video v of dimension: (3, f, h, w). Here, 3 represents the RGB channels and f, h, w are the number of frames, height, and width of each frame, respectively. For the convolution operation, a 3D filter f l of dimen- represents number of feature maps, depth, height and width of the filter, respectively. Max-pooling is applied to the output of this convolution across a 3D sliding window of dimension (m p , m p , m p ).
In our model, we use 128 feature maps for 3D filters of size 5. For pooling, we set m p to be 3 whose output is fed to a fully connected layer with 100 neurons. All the values are decided using hyperparameter tuning (see Section 5). For the input utterance, the activations of this layer form the video representation v u .

Fusion:
We perform feature level fusion to map the individual modalities to a joint space. This is done through a simple feature concatenation. Thus, the extracted features t u , a u and v u are joined to form the utterance representation u = [t u ; a u ; v u ] of dimension d in = 300. This multimodal representation is generated for all utterances in a conversation.
Literature consists of numerous fusion techniques for multimodal data (Atrey et al., 2010;Zadeh et al., 2017;Poria et al., 2017c). Exploring these on CMN, however, is beyond the scope of this paper and left as a future work.

Conversational Memory Network
For classifying the emotion of an utterance u i , its corresponding histories (hist a and hist b ) are taken. Each history hist λ contains the preceding K utterances by person P λ (see Section 3). Here, both u i and utterances in the histories are represented using their multimodal feature vectors of dimension R d in (Figure 2).
The histories are first modeled into memory cells using GRUs. This provides the memories with context information summarized by the GRU. We call this step as memory representation. Following cognitive evidence of self-emotional dynamics, we model separate memory cells for each person. Thus, identical but separate computations are performed on both histories. From these memories, content relevant to utterance u i is then filtered out using attention mechanism over multiple input/output hops. At each hop, both memories are accumulated and merged with u i to model interspeaker emotional dynamics. First, we describe our model as a single layer memory network which runs one hop operation on the memories.

Single Layer
Here, we explain the representation scheme of the memories for both histories and the input/output operations on them along with attention mechanism. The memory representation for each history is generated using a GRU for modeling emotion transitions. First, we define the GRU cell. Gated Recurrent Unit: GRUs are a gating mechanism in recurrent neural networks introduced by . Similar to an LSTM (Hochreiter and Schmidhuber, 1997), GRU provides a simpler computation with similar performance. At any timestep t, it utilizes two gates r t (reset gate) and z t (update gate) to control the combination criteria with current input utterance u t and previous hidden state s t−1 .
The new state s t is computed as:   λ for all R hops. Then, attention based filtering using multiple memory hops is performed. Finally, Person A's utterance u i is classified to predict its emotion category.
Here, V, W and b are parameter matrices and vector and ⊗ represents element-wise multiplication. The above equations can be summarized as: Memory Representation: For each λ ∈ {a, b}, a memory representation M λ = [m 1 λ , ..., m K λ ] for hist λ is generated using a GRU. To grasp the temporal context, the K utterances in hist λ are framed as a sequence (starting from the oldest one) and fed to the GRU λ . At each timestep t ∈ [1, K], the GRU λ 's internal state s t (equation 5) forms the t th memory cell m t λ of memory representation M λ . Memory Input: This step takes the memory representation M λ and performs an attention mechanism on it, resulting in an attention vector p λ ∈ R K . First, the current utterance u i is embedded into a vector q i of dimension R d using a projection matrix B ∈ R d×d in . To find the relevance of each memory m t λ 's context with q i , a match between both is computed. We do this by taking an inner product as follows: Here, sof tmax(x i ) = e x i ∑ j e x j and attention vector p λ = {p t λ } is a probability distribution over the input memories M λ = {m t λ } for t ∈ [1, K].
Thus, the output representation o λ contains weighted contextual summary accumulated from the memory.
Final Prediction: To generate the predictions for the current utterance u i , we combine the output representations of both persons: o a and o b with u i 's representation q i and perform an affine transformation using matrix W o . Softmax is applied to this final vector to get the emotion predictions, Categorical cross-entropy is used as the loss: Here, N denotes total utterances across all videos and C is the number of emotion categories. y i is the one-hot vector ground truth of i th utterance from the training set andŷ i,j is its predicted probability of belonging to class j.

Multiple Layers
Many recent works on memory networks adopt a multiple hop scheme in their network. This repeated input and output cycle on the memories along with a soft attention module, leads to a refined representation of the memories (Sukhbaatar et al., 2015;Kumar et al., 2016). Motivated by these works, we extend our model to perform R hops on the memories. This is done by stacking the single hop layers (Section 4.2.1) as follows: • At a particular hop r, the output memory of the previous hop M λ . This constraint of sharing parameters adjacently between layers is added for reduction in total parameters and ease of training.
• At every hop, the query utterance u i 's representation q i is updated as:

Dataset
We perform experiments on the IEMOCAP dataset 2 (Busso et al., 2008). It is a multimodal database of 10 speakers (5 male and 5 female) involved in two-way dyadic conversations. A pair 2 http://sail.usc.edu/iemocap/ of speakers is given multiple conversation scenarios which are grouped in a single session. All the conversations are segmented into utterances. Each utterance is annotated using the following emotion categories: anger, happiness, sadness, neutral, excitement, frustration, fear, surprise, and other. However, in our experiments, we consider the first four categories. This is done to compare our method with state-of-the-art frameworks (Poria et al., 2017b;Rozgic et al., 2012). The dataset provides rich video and audio samples for all the utterances along with transcriptions. Apart from these emotional states, we also investigate the valence and arousal degrees of each utterance. IEMOCAP provides labels for both these attributes on a 5-point Likert scale. Following Aldeneh et al., 2017, we convert the attributes into 3 categories, namely, low (≤ 2), medium (> 2 and < 4) and high (≥ 4). The dataset configuration for the experiments is obtained from Poria et al. (2017b). The first 8 speakers (Session 1 -4) compose the training fold while the last session is used as the testing fold. Overall, the training and testing set comprises of 4290 utterances (120 conversational videos) and 1208 utterances (31 conversational videos), respectively. There is no speaker overlap in the training and testing set to make the model person-independent.

Emotional Influence Patterns
In this section, we perform dataset exploration to check the existence of emotional influences. Figure  3a) presents the emotion sequence of two videos sampled from the dataset. Both videos show the presence of self and inter-speaker emotional influences. Visual exploration of videos from the dataset reveal significant existence of such instances in the conversations. To provide quantitative evidence of the emotional influence patterns, we curate a non exhaustive list of possible cases of influence. For all utterances in the dataset, we sample their histories by setting K = 5, i.e., five previous utterances (as per availability) from both speakers.
Cases 1 and 2 (Figure 3) represent scenarios when the emotion of current utterance is influenced by self or the other person respectively. In case 3, the utterance has relevant content in the histories that do not precede immediately. An effective attention mechanism provides the capability to capture this pattern. Finally case 4 presents the situation when the utterance is independent of the history. Such situations are indicative from the content of the utterance which often deviates from the previous topic of discussion or introduces a new information. Table 1 presents a statistical summary of these cases present in the dataset. From the table it can be seen that a large section of the dataset demonstrate these influence patterns. This provides motivation to explicitly model these patterns. We thus hypothesize that models that are able to capture these cases would have superior emotion inference capabilities.
This passive exploration is a label-based analysis which is performed as a sanity check. Needless to say, existence of some false positive patterns at the label level is imminent. On the other hand, our model CMN is content-based which enables it to mine intricate patterns from the utterance histories.

Training Details
We use 10% of the training set as a held-out validation set for hyperparameter tuning. To optimize the parameters, we use Stochastic Gradient Descent (SGD) optimizer, starting with an initial learning  Table 1: Percentage of occurrence of different cases in the dataset as mentioned in Section 5.2. All cases are analyzed with K = 5. Utterances whose history has atleast 3 similar emotion labels in either own history or the history of the other person, is counted in case 1 or 2, respectively. Case 3 is considered when the utterance's emotion is found in atleast 3 utterances which occur before the second past-utterance of each history. Case 4 is considered when no history has the emotion label of the current utterance. rate (lr) of 0.01. An annealing approach halves the lr every 20 epochs and termination is decided using an early-stop measure with a patience of 12 by monitoring the validation loss. Gradient clipping is used for regularization with a norm set to 40. Hyperparameters are decided using a Random Search (Bergstra and Bengio, 2012). Based on validation performance, context window length K is set to be 40 and the number of hops R is fixed at 3 hops. If K previous utterances are unavailable, then null utterances are added at the beginning of the history sequence. The dimension size of the memory cells d is set as 50.

Baselines
We compare CMN with the following baselines: SVM-ensemble: A strong context-free benchmark model which uses similar multimodal approach on an ensemble of trees. Each node represents binary support vector machines (SVM) (Rozgic et al., 2012).

bc-LSTM:
A bi-directional LSTM equipped with hierarchical fusion, proposed by Poria et al., 2017b. It is the present state-of-the-art method. The model uses context features from unimodal LSTMs and its concatenation is fed to a final LSTM for classification. For fair comparison in an end-toend learning paradigm, we remove the penultimate SVM of this model. The model doesn't accommodate inter-speaker dependencies.
Memn2n: The original memory network as proposed by Sukhbaatar et al., 2015. Contrasting to CMN, the model generates the memory representations for each historical utterance using an embedding matrix B as used in equation 7, without sequential modeling. Thus for utterance u i , both memories are created as M λ using {m t λ = B.u t u t ∈ hist λ and t ∈ [1, K]} for λ ∈ {a, b}. CMN Self : In this baseline, we use only self history for classifying emotion of utterance u i . Thus, if u i is spoken by person P a , then only hist a is considered. Clearly, this variant is also incapable of modeling inter-speaker dependencies.
CMN N A : Single layer variant of the CMN with no attention module. Thus, its output o λ (equation 8) is generated using a uniform probability distribution p λ , i.e., {p t λ = 1 K } K t=1 .  (Poria et al., 2017b). †: significantly better than bc-LSTM 1 Table 2: Comparison of CMN and its variants with state-of-the-art models (Section 5.2.2). All results use multimodal features. We report scores using weighted accuracy (WAA) and unweighted recall (UAR). UAR is a popular metric that is used when dealing with imbalanced classes (Rosenberg, 2012). Results are an average of 10 runs with varied weight initializations. We assert significance when p < 0.05 under McNemar's test. els. CMN succeeds over both neural (Poria et al., 2017b) and SVM-based (Rozgic et al., 2012) methods by 3.3% and 8.12%, respectively. Improvement in performance is seen for all emotions over the ensemble-SVM based method. A similar trend is seen with bc-LSTM (Poria et al., 2017b), where our model does explicitly well for the active emotions happiness and anger. This trend suggests that CMN is capable of capturing inter-speaker emotional influences which are often seen in the presence of such active emotions.

Results
The importance of sequential processing of the histories using a recurrent neural network (in our case, a GRU) is evidenced by the poorer performance of Memn2n with respect to CMN. This suggests that gathering contexts temporally through sequential processing is indeed a superior method over non-temporal memory representations. CMN self which uses only single history channel also provides lesser performance when compared to CMN. This signifies the role of inter-speaker influences that often moderate the emotions of the current utterance. Overall, predictions on valence and arousal levels also show similar results which reinforce our hypothesis of CMN's ability to model emotional dynamics.  Hyperparameters: Figure 4 provides a summary of the performance trend of our model for different values of the hyperparameters K (context window length) and Q (number of hops). In the first graph, as K increases, more past-utterances are provided to the model as memories. The performance maintains a positive correlation with K. This trend supplements our intuition that the historical context acts as an essential resource to model emotional dynamics. Given enough history, the performance saturates. The second graph shows that multiple hops on the histories indeed lead to an improvement in performance. The attention-based filtering in each hop provides a refined context representation of the histories. Models with hops in the range of 3 − 10 outperform the single layer variant. However, each added hop contributes a new set of parameters for memory representation, leading to an increase in total parameters of the model and making it susceptible to overfitting. This effect is evidenced in the figure where higher hops lead to a dip in performance. 1.

Increasing attention
Thanks that helps. I feel much better now. [ang] 7.
(b) Correct label: anger Figure 5: Average attention vectors across 3 hops for both memories for a given test utterance. this shift is the improved representational scheme of the textual modality. Text tends to have lesser noisy signals as opposed to audio-visual sources, thus providing better features in the joint representation. Overall, multimodal systems outperform the unimodal variants justifying the design of CMN as a multimodal system. Table 3 also showcases the superiority of CMN and its variants over bc-LSTM. The proposed model achieves better performance over the state of the art in all the unimodal and multimodal segments. This asserts the importance of the memorynetwork framework and its ability to effectively store context information.
Role of Attention: Attention module plays a vital role in memory refinement. This is also observed in Table 2, where CMN N A provides inferior performance over CMN. With the uniform weight, all the memory cells in both memories M a and M b equally contribute to the output representation. This incorporates irrelevant information from the perspective of emotional context.
Case Study: We perform qualitative visualization of the attention module by applying it on the testing set. Figure 5a represents a conversation where both the speakers are in an excited and jolly mood. Person A, in particular, drives the dialogue with less influence from Person B. To classify the test utterance of A, the attention module of CMN successfully focuses on the utterances 1, 3, 5 which had triggered the speaker's positive mood in the video. This shows CMN's capacity to model speaker-based emotions. Also, at the textual level, utterances 3 and 6 do not seem to depict a happy mood. However, audio and visual sources provide contrasting evidence which helps CMN to correctly model them as utterances spoken with happiness. This shows the advantage of a multimodal system.
In Figure 5b we reiterate through the dialogue presented in Figure 1. As shown, Person A converses in a sad mood (utterances 1, 3, 5 in Fig 5b), bounded by the grief of his wife's departure. But when he expresses his inhibitions, his wife B reacts in an angry and sarcastic manner (utterance 7). This ignites an emotional shift for A who then replies angrily. In this example, CMN is able to focus on utterance 7 spoken by B to anticipate A's test utterance to be an angry statement, thus showing its ability to model inter-speaker influences. However, there are cases where our model fails, e.g., in the absence of historical utterances as this forces attention to focus on null memories.

Conclusion
In this paper, we presented a deep neural framework that identifies emotions for utterances in dyadic conversational videos. Our results suggest that leveraging context information from utterance histories and representing them as memories indeed helps to better recognize emotions. Performing speaker-specific modeling and considering interspeaker influences also helps in capturing emotional dynamics. This work also showed the importance of attention mechanism in filtering relevant contextual information from utterance histories and, hence, paved the path to the development of more efficient and human-like dialogue systems.