ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection

Emotion recognition in conversations is crucial for building empathetic machines. Present works in this domain do not explicitly consider the inter-personal influences that thrive in the emotional dynamics of dialogues. To this end, we propose Interactive COnversational memory Network (ICON), a multimodal emotion detection framework that extracts multimodal features from conversational videos and hierarchically models the self- and inter-speaker emotional influences into global memories. Such memories generate contextual summaries which aid in predicting the emotional orientation of utterance-videos. Our model outperforms state-of-the-art networks on multiple classification and regression tasks in two benchmark datasets.


Introduction
Emotions play an important role in our daily life. A long-standing goal of AI has been to create affective agents that can detect and comprehend emotions. Research in affective computing has mainly focused on understanding affect (emotions and sentiment) in monologues. However, with increasing interactions of humans with machines, researchers now aim at building agents that can seamlessly analyze affective content in conversations. This can help in creating empathetic dialogue systems, thus improving the overall human-computer interaction experience (Young et al., 2018).
Analyzing emotional dynamics in conversations, however, poses complex challenges. This is due to the presence of intricate dependencies between the affective states of speakers participating in the dialogue. In this paper, we address the problem of emotion recognition in conversational videos. We specifically focus on dyadic conversations where two entities participate in a dialogue.
We propose Interactive COnversational memory Network (ICON), a multimodal network for identifying emotions in utterance-videos. Here, utterances are units of speech bounded by breaths or pauses of the speaker. Emotional dynamics in conversations consist of two important properties: self and inter-personal dependencies (Morris and Keltner, 2000). Self-dependencies, also known as emotional inertia, deal with the aspect of emotional influence that speakers have on themselves during conversations (Kuppens et al., 2010). On the other hand, inter-personal dependencies relate to the emotional influences that the counterparts induce into a speaker. Conversely, during the course of a dialogue, speakers also tend to mirror their counterparts to build rapport (Navarretta et al., 2016). Figure 1 demonstrates a sample conversation from the dataset involving both self and interpersonal dependencies. While most conversational frameworks only focus on self dependencies, ICON leverages both such dependencies to generate affective summaries of conversations. First, it extracts multimodal features from all utterancevideos. Next, given a test utterance to be classified, ICON considers the preceding utterances of both speakers falling within a context-window and models their self-emotional influences using local gated recurrent units (GRUs).
Furthermore, to incorporate inter-speaker influences, a global representation is generated using a GRU that intakes output of the local GRUs. For each instance in the context-window, the output of this global GRU is stored as a memory cell. These memories are then subjected to multiple read/write cycles that include attention mechanism for generating contextual summaries of the conversational history. At each iteration, the representation of the test utterance is improved with this summary representation and finally used for prediction. I don't think I can do this anymore.
[ frustrated ] Well I guess you aren't trying hard enough. [ neutral ] Its been three years. I have tried everything.
[ frustrated ] Maybe you're not smart enough.
[ neutral ] Just go out and keep trying.
[ neutral ] I am smart enough. I am really good at what I do. I just don't know how to make someone else see that.
[anger] Pa is frustrated over her long term unemployment and seeks encouragement (u1, u3). P b , however, is pre-occupied and replies sarcastically (u4). This enrages Pa to appropriate an angry response (u6). In this dialogue, emotional inertia is evident in P b who does not deviate from his nonchalant behavior. Pa, however, gets emotionally influenced by her counterpart. This influence is content-based, not label-based.

Person B Person
The contributions of this paper are as follows: • We propose ICON, a novel model for emotion recognition that incorporates self and interspeaker influences in a dialogue. Memory networks are used to model contextual summaries for prediction.
• We introduce a multimodal approach that provides comprehensive features from modalities such as language, visual, and audio in utterancevideos.
• ICON can be considered as a generic framework for conversational modeling that can be extended to multi-party conversations.
• Experiments on two benchmark datasets show that ICON significantly outperforms existing models on multiple discrete and continuous emotional categories.
The remainder of the paper is organized as follows: Section 2 presents related works; Section 3 formalizes the problem statement and Section 4 describes our proposed approach; Section 5 provides details on experimental setup; Section 6 reports the results and related analysis; finally, Section 7 concludes the paper.

Related Works
Emotion recognition is an interdisciplinary field of research with contributions from psychology, cognitive science, machine learning, natural language processing, and others (Picard, 2010). Initial research in this area primarily involved visual and audio processing (Ekman, 1993;Datcu and Rothkrantz, 2008). The role of text in emotional analysis became evident with later research such as Alm et al. (2005); Strapparava and Mihalcea (2010). Current research in this domain is mainly performed from a multimodal learning perspective (Poria et al., 2017a;Baltrušaitis et al., 2018). Numerous previous approaches have relied on fusion techniques that leverage multiple modalities for affect recognition (Soleymani et al., 2012;Tzirakis et al., 2017;Zadeh et al., 2018b).
Understanding conversations is crucial for machines to replicate human language and discourse. Emotions play an important role in shaping such social interactions (Ruusuvuori, 2013). Richards et al. (2003) attribute emotional dynamics to be an interactive phenomena, rather than being withinperson. We utilize this trait in the design of our model that accommodates inter-personal dynamics. Being a temporal event, context also plays an important role in conversational analysis. Poria et al. (2017b) use contextual information from neighboring utterances of the same speaker to predict emotions. However, there is no provision to model interactive influences. Work by Yang et al., 2011;Xiaolan et al., 2013 stresses the study of patterns for emotion transitions. In contrast, we posit the use of utterance content to model context with multimodal features.
In the literature, memory networks have been successfully applied in many areas, including question-answering (Weston et al., 2014;Sukhbaatar et al., 2015;Kumar et al., 2016), machine translation , speech recognition (Graves et al., 2014), and others. In emotional analysis, Zadeh et al. (2018a) propose a memory-based sequential learning for multi-view signals. Although we utilize memory networks, our work is different as we use memories to encode whole utterances. Also, each memory cell in our network is processed using GRUs to capture temporal dependencies. This technique deviates from the traditional use of embedding matrices to encode information into memory cells.
ICON builds on our previous research (Hazarika et al., 2018) that used separate memory networks for both interlocutors participating in a dyadic conversation. In contrast, ICON adopts an interactive scheme that actively models inter-speaker emotional dynamics with fewer trainable parameters.

Problem Setting
Let us define a conversation U to be a set of asynchronous exchange of utterances between two persons P a and P b over time. With T utterances, U = {u 1 , u 2 , ..., u T } is a totally ordered set which can be arranged as a sequence (u 1 , ..., u T ) based on temporal occurrence. Here, each utterance u i is spoken by either P a or P b . Furthermore, for each λ ∈ {a, b}, U λ denotes person P λ 's individual utterances in U , i.e., U λ = {u i u i ∈ U and u i spoken by P λ , ∀i ∈ [1, U ]}. This provides two sets of utterances for both the respective speakers, such that U = U a ∪ U b .
Our aim is to identify the emotions of utterances in conversational videos. At each time step t ∈ [1, T ] of video U , our model is provided with the utterance spoken at that time, i.e. u t , and tasked to predict its emotion. Moreover, we also utilize the previous utterances within U spoken by both persons. Considering a context-window of size K, the preceding utterances of P a and P b (starting with the most recent) within this context-window can be represented by H a and H b , respectively. Formally, for each λ ∈ {a, b}, H λ is created as, Table 1 provides a sample conversation with a context-window of size K = 5. Context-window K = 5. Here, u λ i = i th utterance by P λ .

Methodology
ICON has been designed as a generic framework for affective modeling of conversations. Its computations can be categorized as a sequence of four successive modules: Multimodal Feature Extraction, Self-Influence Module, Dynamic Global-Influence Module, and Multi-hop Memory. Figure 2 illustrates the overall model.

Multimodal Feature Extraction
ICON adopts a multimodal framework and performs feature extraction from three modalities, i.e., language (transcripts), audio and visual.
These features are extracted for each utterance in the conversation and their concatenated vectors serve as the utterance representations. The motivation of this setup derives from previous works that demonstrate the effectiveness of multimodal features in creating rich feature representations (D'mello and Kory, 2015). These features provide complementary information from heterogeneous sources which helps to accumulate comprehensive features. Its need is particularly pronounced in videos as they are often plagued with noisy signals and missing-information within individual modalities (e.g., facial occlusion, loud background music, imperfect transcriptions).

Textual Features
We employ a convolutional neural network (CNN) to extract textual features from the transcript of each utterance. CNNs are capable of learning abstract semantic representations of a sentence based on its words and n-grams (Kalchbrenner et al., 2014). For our purpose, we utilize a simple CNN with a single convolutional layer followed by max-pooling (Kim, 2014). The input to this network consists of pre-trained word embeddings extracted from the 300-dimensional FastText embeddings (Bojanowski et al., 2016). The convolution layer consists of three filters with sizes f 1 t , f 2 t , f 3 t with f out feature maps each. We perform 1D convolutions using these filters followed by max-pooling on its output. The pooled features are finally projected onto a dense layer with dimension d t and its activations are used as the textual representation t u ∈ R dt .

Audio Features
Audio plays a significant role in determining the emotional states of a speaker (De Silva and Ng, 2000;Song et al., 2004). To extract audio features, we first format the audio of each utterance-video as a 16-bit PCM WAV file and use the open-sourced software openSMILE (Eyben et al., 2010). This tool provides high dimensional vectors for audio files that summarizes important statistical descriptors such as loudness, pitch, Mel-spectra, MFCC, etc. Specifically, we use the IS13 ComParE 1 extractor which provides 6373 features for each utterance. The features are then normalized using Min-Max scaling followed by L2-based feature selection. This selection provides low-dimensional audio features a u ∈ R da of dimensions d a .

Visual Features
Visual indicators such as facial expressions are key to understand emotions. In our work, we use a deep 3D-CNN to model spatiotemporal features of each utterance video (Tran et al., 2015). 3D-CNN helps to understand emotional concepts such as smiling or frowning that are often spread across multiple frames of a video with no predefined spatial location. The input to this network is a video with dimensions (c, h, w, f ), where c is the number of channels, h, w are the height and width of each frame, with a total of f frames per video. The network contains three blocks of convolution where each block contains two convolutional layers followed by max-pooling. For the convolution, 3D filters are employed having dimensions represents the number of feature maps, input channels, height, width, and depth of the filter, respectively. After a non-linear reLU activation (LeCun et al., 2015), max-pooling is performed using a sliding window of dimensions (m p , m p , m p ). For an input utterance video, the final features of the third convolutional block is mapped onto a dense layer of dimension d v whose activations are used as the visual features v u ∈ R dv .

Fusion
We generate the final representation of an utterance u by concatenating all three multimodal features: Concatenation is one of the most common fusion methods (Shwartz et al., 2016). Its simplicity also allows us to emphasize the contribution of the remaining components of ICON.

SIM: Self-Influence Module
Given a test utterance u t to be classified, this module independently processes the histories of both speakers. SIM consists of two GRUs, GRU s a and GRU s b , for H a and H b , respectively. For each λ ∈ {a, b}, GRU s λ attempts to model the emotional inertia of speaker P λ which represents the emotional dependency of a speaker with their own previous states. In particular, for each historical utterance u i<t ∈ H λ , an internal memory state h (j) λ is computed by GRU s λ conditioned on utterance u i and previous memory state h (j−1) λ . This can be abbreviated as h Gated Recurrent Unit: GRUs are gated recurrent cells introduced by . At time step j, GRU computes hidden state s j ∈ R dem by calculating two gates, r j (reset gate) and z j (update gate) with j th input x j and previous state s j−1 . The computations are:

DGIM: Dynamic Global Influence Module
Emotions are not only regarded as internalpsychological phenomena but also interpreted and processed communicatively through social interactions (Fiehler, 2002). Conversations exemplify such a scenario where inter-personal emotional influence persists. Theories in cognitive science also suggest the existence of emotional contagion that causes humans to mirror their counterpart's gesture, posture and emotional state (Chartrand and Bargh, 1999;Navarretta et al., 2016). Additionally, these interactions occur dynamically through the discourse of a dialogue. While modeling the contextual history, we incorporate such properties using a dynamic influence module. This module maintains a global representation of the conversation and updates it recurrently at each time step of the K-length conversation history. For any k ∈ [1, K], the global state is updated using a GRU operation on the previous state s k−1 and current speaker P λ 's SIM memory h (j) λ for the corresponding spoken utterance u (t−K+k−1) , i.e., h (j) λ = GRU s λ (u (t−K+k−1) ). Formally, DGIM consists of a GRU network, GRU g , where the k th global state s k is computed as:

R hops
Contextual History  Table 1.

Multi-hop Memory
The overall operation of the GRU g produces a sequence of memories M = [s 1 , ..., s K ] ∈ R dem×K . These memories incorporate dynamic influences from each of the K utterances spoken in the history. They serve as a contextual memory bank from which selective person-specific information can be incorporated into test utterance u t to get discriminative features. To achieve this, a series of R memory read/write cycles are performed that are coupled with soft attention for refinement of u t into a context-aware representation. The need for multiple hops is inspired by recent works on memory networks (Kumar et al., 2016;Weston et al., 2014), which suggests the importance of multiple read/write iterations for performing transitive inference. Multiple hops also help in improving the focus of attention heads which might miss essential memories in a single hop. At the r th hop, the computations are as follows: • Memory Read: An attention mechanism is used to read the memories from r th memory bank M (r) (Weston et al., 2014). First, each memory m (1) t = u t and M (1) = M ). This matching generates an attention vector p (r) attn ∈ R K whose k th normalized score represents the relevance of k th memory cell with respect to the test utterance. Inner product is used for the matching as follows: Where, sof tmax(x i ) = e x i ∑ j e x j . These scores are then used to find a weighted representation of the memories as (7) This vector denotes the summary of the context that is person-specific and based on the test utterance. Finally, the representation of the test utterance is updated by consolidating itself with the weighted memory m as: • Memory Write: After the read operation at each hop, memories are updated for the next hop. For this purpose, a GRU network, GRU m , takes the r th memory cells M (r) as input and reprocesses this sequence to generate memories M (r+1) , i.e., M (r+1) = GRU m (M (r) ). Across all hops, this write operation can be viewed as that of a stacked recurrent neural network (RNN) where each level (or hop) improves the representational output of the RNN. The parameters of GRU m are shared across all hops.
Final Prediction: We use the (R + 1) th test utterance vector u (R+1) t and get the final prediction vector through its affine transformation, For classification, dimensions of vector o is the number of classes C, i.e., o ∈ R C and categorical cross-entropy loss is used as the cost measure for training. For regression, o is a scalar (without softmax normalization) whose scores are used to calculate the mean squared error cost metric.

Datasets
We perform experiments on two benchmark datasets in dialogue-based emotion detection: IEMOCAP is a database consisting of videos of dyadic conversations between pairs of 10 speakers. Grouped into five sessions, each pair is assigned with diverse scenarios for dialogues. Videos are segmented into utterances with annotations of finegrained emotion categories. We consider six such categories for the classification task: anger, happiness, sadness, neutral, excitement, and frustration. The training set is curated using the first 8 speakers from session 1-4 while session 5 is used for testing.
SEMAINE is a video database of human-agent interactions. Here, users interact with characters whose responses are based on users' emotional state. Specifically, we utilize the AVEC 2012's fully continuous sub-challenge (Schuller et al., 2012) that requires predictions of four continuous affective dimensions: arousal, expectancy, power, and valence. The gold annotations are available for every 0.2 seconds in each video (Nicolle et al., 2012). However, to align with our problem statement, we approximate the utterance-level annotation as the mean of the continuous values within the spoken utterance. The sub-challenge provides standard training and testing splits which has been summarized in Table 2.

Training Details
20% of the training set is used as validation set for hyper-parameter tuning. We use the Adam optimizer (Kingma and Ba, 2014) for training the parameters starting with an initial learning rate of 0.001. Termination of the training-phase is decided by early-stopping with a patience of 10 d [t,a] = 100 dv = 512 dem = 100 K = 40 R = 3 epochs. The network is subjected to regularization in the form of Dropout (Srivastava et al., 2014) and Gradient-clipping for a norm of 40. Finally, the best hyper-parameters are decided using a gridsearch. Their values are summarized in Table 3. For multimodal feature extraction, we explore different designs for the employed CNNs. For text, we find the single layer CNN to perform at par with deeper variants. For visual features, however, a deeper CNN provides better representations. We also find that contextually conditioned features perform better than context-less features. Thus, in our experiments, we extract video-level contextual features for utterances from each modality using the network proposed by Poria et al. 2017b. These modified features are then used to form the multimodal utterance representations using equation 3.

Baselines
We compare our proposed model with multiple state-of-the-art networks in multimodal utterancelevel emotion detection.
• memnet (Sukhbaatar et al., 2015) is an end-toend memory network. For comparison, we modify our network to adopt their embedding-based memory-encoding in the multi-hop stage.
• MFN (Zadeh et al., 2018a) performs multi-view learning by using Delta-memory Attention Network, a fusion mechanism to learn cross-view interactions. Similar to TFN, the modeling is performed within utterances.
• CMN (Hazarika et al., 2018) models separate contexts for both speaker and listener to an utterance. These contexts are stored as memories and combined with test utterance using attention mechanism.

Results
Tables 4 and 5 present the results on the IEMO-CAP and SEMAINE testing sets, respectively. In Table 4, we evaluate the mean classification performance using Weighted Accuracy (acc.) and F1-Score (F1) on the discrete emotion categories. ICON performs better than the compared models with significant performance increase in emotions (∼2.1% acc.). For each emotion, ICON outperforms all the compared models except for happiness emotion. However, its performance is still at par with cLSTM without a significant gap. Also, ICON manages to correctly identify the relatively similar excitement emotion by a large margin.
In Table 5, evaluations of the four continuous labels from SEMAINE are performed using Mean   Absolute Error (MAE) and Pearson's Correlation Coefficient (r). In all the labels, ICON attains improved performance over its counterparts, suggesting the efficacy of its context-modeling scheme.
Hyperparameters: We plot the performance trends of ICON on the IEMOCAP dataset concerning the two main hyperparameters, R (number of hops) and K (context-window size). For R, the performance initially improves showing the importance of multiple hops in the memories. However, with a further increase, the hopping recurrence deepens and causes the vanishing gradient problem. This leads to decrease in performance. The best performance is obtained at R = 3. For K, similar trends are observed where performance improvement is seen by increasing the number of historical utterances. The best results are obtained for K = 40 which also aligns with the average number of historical utterances in the dataset (Table 2). Further increase in context does not provide relevant information and rather leads to performance degradation due to model confusion.

Multimodality:
We investigate the importance of multimodal features for our task. Table 6 presents the results for different combinations of modes used by ICON on IEMOCAP. As seen, the trimodal network provides the best performance which is preceded by the bimodal variants. Among  unimodals, language modality performs the best, reaffirming its significance in multimodal systems. Interestingly, the audio and visual modality, on their own, do not provide good performance, but when used with text, complementary data is shared to improve overall performance.

Ablation Study
To check the importance of the modules present in ICON, we perform an ablation study where we remove constituent components and evaluate the model's performance. Table 7 provides the results on this study. In the first variant, none of the histories and the associated context-modeling is used. This provides the worst relative performance.
Self vs Dual History: We evaluate the scenarios where only self-history of the speaker is considered (variants 2, 4, and 6). Compared to the dual-history variants (variants 3, 5, and 7), these models provide lesser performance. Reasons involve the provision of partial information from the conversational histories. Similar trends can be seen for the cLSTM model in Table 4 which works in the same regime. DGIM prevents the storage of dynamic influences between speakers at each historical time step and leads to performance deterioration.
Multi-hop vs No-hop: Variants 2 and 3 represent cases where multi-hop is omitted, i.e., R = 1. Performance for them are poorer than variants having multi-hop mechanism (variants 4-7). Also, removal of multi-hop leads to worse performance than the removal of DGIM. This suggests that multi-hop is more crucial than the latter. However, best performance is achieved by variant 6 which contains all the proposed modules in its pipeline.

Dependency on distant history
For all the test utterances of IEMOCAP correctly classified by ICON, we analyze the global memories receiving the highest attention. First, we divide the conversational history (context-length K = 40) into three regions: long, short, and medium. Figure 4 provides a summary of how much the model attends each of these regions. The short region (labeled green) covering 10 utterances, corresponds to conversational history just preceding the test utterance. Utterances which occur more than 30 time steps behind the current test utterance are considered part of the long region (labeled red). Remaining utterances in between fall on the medium region (labeled blue).
The distribution of top-valued attention scores across the histories reveal interesting insights. Most of the correctly classified instances focus on the immediate or short history. In other words, 63% of the time, at least one of the top-5 attention value belongs to a memory in the short-history range. A significant share is also present for distant history (22%). This result indicates the presence of long-term emotional dependencies and the need to consider histories far away from the current test utterance. u11 u12 a) Self-Emotional Influence b) Inter-speaker Influence Figure 6: Case studies for emotional influence. 20 memories in the history which are nearest to test utterance, i.e. k ∈ [21,40] are visualized from the trained ICON.

Dynamic Modeling of Global Memories:
ICON holds the capability to model dynamic interactions between speakers. The memories by its DGIM ( §4.3) are used to create summaries conditioned on the test utterance. Consequently, these summaries contain characteristics that are specific to the affective state of the current speaker (of the test utterance). Figure 5 presents a sample slice of conversation from the dataset. As seen, summary selection for Person A varies from Person B. Such differences arise due to person-specific characteristics and unique affective interpretations of the conversation. Apart from the inter-speaker variance, the emotional state of a speaker also varies across turns.

Case Studies
To understand ICON's behavior while processing the global memories through multi-hop, we manually explore the utterances in the testing set of IEMOCAP. Figure 6 presents two cases which provide traces of self and inter-personal emotional influences and were correctly classified by ICON. Both the figures show the trend where multiple hops gradually improve the focus of attention mechanism on relevant memories.
In Figure 6a, person P a registers a complaint to an operator P b . Throughout the dialogue, P a maintains an angry demeanor while P b remains calm and neutral (u 14 ). While classifying utterance u 15 , ICON focuses more on the histories uttered by P a (u 6 , u 8 , and u 11 ). This demonstrates ICON's ability to model self-emotional influences. It should be noted that emotion of P a here also depends on the utterances of P b but compared to self-utterances, this dependency is much less. Figure 6b presents another scenario where a couple argue over an alleged affair.
A man (P b ) is angry over this fact and questions his partner (P a ) asking for details. The woman tries to behave unperturbed by providing neutral responses (u 12 , u 16 ) but is eventually affected by P b 's continuous anger and expresses a frustrated response (u 18 ). These characteristics are captured by the attention mechanism applied on the global memories (generated by DGIM), which finds contextual information from histories that are relevant to the test utterance u 18 . This example displays the role of inter-speaker influences and how ICON processes such dependencies.

Conclusion
In this paper, we presented ICON, a multimodal framework for emotion detection in conversations. ICON capitalizes on modeling contextual information that incorporates self and inter-speaker influences. We accomplish this by using an RNNbased memory network with multi-hop attention modeling. Experiments show that ICON outperforms state-of-the-art models on multiple benchmark datasets. Extensive evaluations and case studies demonstrate the effectiveness of our proposed model. Additionally, the ability to visualize the attentions brings a sense of interpretability to the model, as it allows us to investigate which utterances in the conversational history provide important emotional cues for the current emotional state of the speaker.
In the future, we plan to test ICON on other relevant dialogue-based applications and also use it for empathetic dialogue generation.