COIN: Conversational Interactive Networks for Emotion Recognition in Conversation

Emotion recognition in conversation has received considerable attention recently because of its practical industrial applications. Existing methods tend to overlook the immediate mutual interaction between different speakers in the speaker-utterance level, or apply single speaker-agnostic RNN for utterances from different speakers. We propose COIN, a conversational interactive model to mitigate this problem by applying state mutual interaction within history contexts. In addition, we introduce a stacked global interaction module to capture the contextual and inter-dependency representation in a hierarchical manner. To improve the robustness and generalization during training, we generate adversarial examples by applying the minor perturbations on multimodal feature inputs, unveiling the benefits of adversarial examples for emotion detection. The proposed model empirically achieves the current state-of-the-art results on the IEMOCAP benchmark dataset.


Introduction
Emotion recognition in conversation (ERC) has attracted extensive interests due to the prevalence of user-generated contents on social media platforms, such as conversational messages and videos (Poria et al., 2017;Hazarika et al., 2018b;Hazarika et al., 2021), which aims to detect the speaker's emotions and sentiments within the context of human conversations. Recent works on ERC adopted recurrent neural networks (RNNs) to firstly learn the sequential utterances in conversations and then leveraged highlevel context extractor, such as CMN (Hazarika et al., 2018b), DialogueRNN , DialogueGCN (Ghosal et al., 2019), to capture the global contextual representation for emotion detection. * Equal contribution This two-step scheme has proven to be effective to achieve success in ERC and can be divided into two categories: one is modeling each speaker with one RNN, such as (Hazarika et al., 2018b;; the other is speaker-agnostic, i.e., modeling each utterance using one RNN irrespective of its speaker, such as (Poria et al., 2017;. However, there is no direct dyadic interaction between speaker-specific RNNs in previous work. Different RNNs corresponding to different speakers have been used without mutual interaction (Hazarika et al., 2018b) or interacting through a mediate global RNN .
In this paper, the proposed Conversational Interactive Networks (COIN) employs immediate coupling interaction at each state of different speakers and adopts a global extractor to capture the contextual and self-dependency representation for emotion classifier. To enhance the generalization and robustness of our model, we generate adversarial examples by applying minor perturbations on multi-modal embeddings for adversarial training (AT) (Goodfellow et al., 2014).
Our work illustrates that dyadic interaction advances the performance of multimodal emotion recognition in conversation by incorporating mutual interaction and applying adversarial training.
Our key contributions are in threefold: • We introduce state mutual interaction components to allow for the immediate state interaction between different speakers, and global stacked interaction to capture the contextual and inter-dependency representations.

Methodology
This section is orgnized as follows: Sec. 2.1 describes the definition of ERC task; Sec. 2.2 introduces the approach to extracting multimodal features; Sec. 2.3 gives a detailed description of the proposed model.

Task Definition
Let there be M parties or speakers {p 1 , p 2 , · · · , p M } in a human conversation (M = 2 in our experiments). Given the utterances {u 1 , u 2 , · · · , u N } from a conversation where the utterance u t is from the corresponding speaker p s(ut) , the task of ERC is to detect the most likely class from emotion category set C. Here s represents the mapping between the utterances and users.

Multimodal Feature Extraction
We extract multimodal features using the same setting as  for a fair comparison. Multimodal features are simply concatenated along the feature dimension in our systems.
Textual Feature We employ multi-channel 1-D convolutional neural networks (CNNs) along the sequential dimension to extract n-gram lexical features with kernel sizes of {3, 4, 5}. Then a global max-pooling layer followed by a linear projection produces the utterance representation. This CNN is trained on emotion classification at the sentence level.
Acoustic Feature We use openSMILE (Eyben et al., 2010) toolkit § to extract speech features such as Mel-frequency cepstral coefficients (39 features) and pitch. Z-standardization is applied to normalize the low dimensional feature vectors.
Visual Feature 3D-CNN (Tran et al., 2015) is leveraged to obtain visual features from dialogue videos, followed by a ReLU and max-pooling operation. The multimodal inputs of utterances are firstly § https://www.audeering.com/opensmile/ fed into feature extractor to obtain the multimodal features. Then we adopt Gated Recurrent Units (GRUs) (Chung et al., 2014) to capture the history dialogue of dyadic speakers A/B, followed by the mutual interaction for each state at utterance level. Afterward, the concatenated bidirectional mutual history vectors are fed into a stacked contextual interaction module to capture the inter-dependency between current and history dialogue states.

Model Architecture
Speaker Mutual Interaction for Dialogue History Let u i ∈ R d represent the extracted ddimensional multimodal features for i-th speech uttered by speaker P, K be the dialogue history length. We use GRUs in two directions to capture the utterance-level speaker dialogue context. For the forward GRU, we have: where h P ∈ R d indicates the hidden state of speaker P at the step i. The history utterance sequences for speaker P are denoted as U P .
We compute the mutual interaction for each history step i by linearly regulating each output of GRU with the previous hidden state of another speaker. In the forward direction, we have: where h 0 P represents the initial hidden state of speaker P, the sigmoid function σ( the trainable parameters. The identical but reversed operation is applied in the backward direction. The output of both forward and backward direction at step i are concatenated along the feature dimension, denoted as ← → Stacked Contextual Interaction The contextual encoder consists of L identical stacks. In the l-th layer, we feed the history dialogue representations M l into a bi-GRU followed by a self-attention (SA) layer to capture the inter-dependency semantics. In the first layer, M l is the sequence of encoded context ← → m i , and is bi-GRU's output from previous layer for intermediate stacks, i.e., M l−1 g (l > 1).
Denoting the output of bi-GRU as M l g , the  scaled dot product self attention is calculated as: where M att ∈ R K×2d is passed into the bi-GRU in the next interaction stack as the dialogue context. Given the encoded utterance u l t ∈ R 2d at l-th layer (linearly projected multimodal features when l = 0), we calculate the context vector for history dialogues: where the output u t is used as the input of {l+1}-th layer (i.e., u l+1 t ). Emotion Classifier We use L-th stack's output vector u t to get the final emotion prediction through a linear transformation:

Training
Let u represent the multimodal features. The cross entropy loss L xe betweenŷ and golden label y is used for training. To improve the generalization, we generate adversarial examples using the model parameterized by θ θ θ as in (Goodfellow et al., 2014)-adding perturbations on extracted multimodal features: u adv = u + g g 2 , where g = ∇L xe (θ θ θ; u), ∈ R is selected on the held out set. The final training objective is defined as: 3 Experiments

Experimental Setup
Dataset We evaluate our model on the IEMO-CAP dataset (Busso et al., 2008)  with initial learning rate of 1e-3. We employ the exponential annealing with base 2 to adjust the learning rate. For adversarial training, we select = 5 using validation set. To avoid overfitting, we applies dropout keep rate p ∈ {0.2, 0.3, 0.4} and early stopping patience of 10 epoch during training. The optimal hyperparameter settings are: L = 3, d = 100, K = 40, p = 0.3. We use an NVIDIA 2080 Ti GPU for experiments. Table 1 summarizes the performance of the proposed model compared with baseline models, in which our model overshadows previous baselines on both averaged accuracy and F1 metric. We found that the performance of our model ranks first for "angry" and "frustrated" sentiment prediction and achieves similar results on the other emotion classes.

Results
Ablation Study We conduct ablation study on multi-modality (Fig. 2a), adversarial training and Speaker Mutual Interaction (SMI) module (Fig. 2b).   It can be seen model performance reaches its peak by taking the embedding size of 100 (Fig. 2c) and layer number L = 3 (Fig. 2d). Fig. 2b witnesses the influence of adversarial training and SMI on emotion detection. We further conduct experiments by applying adversarial training on various baselines, finding that our model achieves the best results among them. See Appendix B for discussion. Fig. 2a show that among uni-modality, textual features contribute most followed by the acoustic setting whereas video features perform worst in our system. We guess high-level visual features extracted from CNN-3D lack of fine-grain facial representations, which requires further improvement. In dual modality settings, textual and acoustic features make the most contribution to predict emotion categories in comparison with tri-modal fusion settings. Fig. 3 shows an instance of dialogue snippet, where our model captures the emotion dynamics of the male speaker during the conversation process. Using different RNNs to modeling various speaker utterances may circumvent the fluctuation of emotion transitions and effectively capture the emotion transition of disparate speakers. It is also observed in more examples in Appendix C. Exactly, "laugh".

Sorry.
ahh I was always worried you'd be a little bit jealous of me, but I was never quite that lucky.
There's still time for that. Perfect, okay, good. "sigh" [BREATHING] … I don't feel so bad then. Well, I'm excited for you. I know. I can't believe it.

I can't believe it.
Oh, thanks. move here before you get married. [LAUGHTER] …Yeah, don't go and disappear like everyone else after they get married and I never see them again. That'll make me mad.

Conclusion
We propose a new dialogue contextual interaction architecture to focus on the compact interaction for both speaker-level dialogue history and current utterance. By adopting adversarial training, our model achieves the SOTA performance on the IEMOCAP dataset for emotion recognition in conversation. In the future, multimodal fusion methods could be investigated to capture richer modelilyinteractive representations at modality level. Fig. 4 illustrates the confusion matrix of predicted emotions. We found that negative sentiments such as "sad", "angry" can be easily mispredicted as "frustrated", and vice versa. "Happy" emotions exhibit the worst performance among all of six categories, which is difficult for the model to distinguish from "excited". This is in line with our manually observed prediction results because sometimes it is even not obvious for a human to distinguish the emotions with similar polarities, such as "sad" and "frustrated", "happy" and "excited". Further study on learning sentiments of similar polarity may be a solution to such misunderstanding.

B Experiments on Adversarial Training
To verify the advantage of our model using adversarial training (AT), we further conduct experiments on different baseline models and report the result in Table 2. It is clear that our model outranks other models in terms of the overall performance, demonstrating the advantage of our model. Also, it is observed that emotion recognition models do not necessarily improve after incorporating the AT method. Specifically, models using single RNNs to simulate the speaker utterances, such as DialogueRNN and DialogueGCN, show the performance drop after adding AT, whereas models using separate RNNs to model different speakers, like ICON and ours, illustrate the advancement. We extrapolate that the emotion dynamics of different speakers may vary, thus the sensitivity of emotion models is affected by the adversarial noise derived from the conversational context. If different RNNs are adopted for various speaker utterance model-ing, the noise would greatly rely on the current speaker's utterances despite the noise from noisy dialogue context, which eases the learning process of emotion transition.  What. Welcome to the human race.

Frustrated
You think this is what I had in mind? You think that when I propose I had this great fantasy going that four years down the road, we will end up on a beach arguing over fish. You think that I knew that there would be times when you will look at me like I am used Kleenex? Or that I will look at you and think, holy hell what's the next flight to Alaska.

Sad
But this isn't anything like I thought it would be.
No, I know me either.
I mean it is just this , I mean it includes a lot and everything and you know it's the sand and it's the full moon and I just-I am sorry but I couldn't help wishing I was somewhere else.
Maybe if you are with somebody else too?
I didn't say that..
No, I know, but. I know I don't make you happy.
For heaven's sake Augie, wherever I am I always wanted to be with you. Yeah?
For heaven's sake, don't you know that?
Whatever I am doing, I wanted to be with you. I mean you are probably the one who wishes you were with somebody else, somebody who didn't take everything so hard and who knows how to enjoy herself.

…
Augie, do you remember the first time we came to see it? It was about four years ago right after we got married and we thought I was pregnant.
We had a bottle of champagne but no glasses and you ask me to dance so we took off our shoes. … No.
You want me to go get some champagne?
This is standing. This is waiting. This is fighting. Right. (a)

… …
Well, what of it.
He let him kiss you. You said you did.
Well it gave him a lot of pleasure and it didn't hurt me.

Angry
Well, if you hadn't been so nosey and suspicious you never would have known about it.

What of it?
What about me?
Well, that's a nice point of view I must say. think that when I t four years down ng over fish. You when you will look ill look at you and ska.

Sad
But this isn't anything like I thought it would be.
No, I know me either.
I mean it is just this , I mean it includes a lot and everything and you know it's the sand and it's the full moon and I just-I am sorry but I couldn't help wishing I was somewhere else.
Maybe if you are with somebody else too? I didn't say that.. No, I know, but. I know I don't make you happy.
For heaven's sake Augie, wherever I am I always wanted to be with you.

Yeah?
For heaven's sake, don't you know that? Whatever I am doing, I wanted to be with you. I mean you are probably the one who wishes you were with somebody else, somebody who didn't take everything so hard and who knows how to enjoy herself. … t? It was about t I was pregnant. ask me to dance et some champagne?
This is standing. This is waiting. This is fighting.

Right.
(c) He let him kiss you. You said you did.
Well it gave him a lot of pleasure and it didn't hurt me.

Angry
Well, if you hadn't been so nosey and suspicious you never would have known about it.

What of it?
What about me?
Well, that's a nice point of view I must say.

Frustrated
You know why I asked Annie here, don't you? why?
You know.
Well I got an idea. But what's the story? I'm going to ask her to marry me.
Well that's nobody's business but yours, Chris.
You know that's not just my business.
What-What am -what are you going to do, I mean you're old enough to make up your own mind?
So it's all right. I can go ahead with it then? … Angry I'm getting very bored with this conversation.
Do you want some brandy?
No thanks.
Well you're want to make sure your father isn't you know going to--lose them.
So it's all right. I can go ahead with it then? Oh You knew there was nothing in that.
Come to think of it the real cause of that roue was Peter Burden. I knew nothing of the sort. You accepted presents from him.
Sit down, mom. I want to talk to you.
The trouble with the god damned newspapers, … All right, All right mom. Listen! (d)