An Iterative Emotion Interaction Network for Emotion Recognition in Conversations

Emotion recognition in conversations (ERC) has received much attention recently in the natural language processing community. Considering that the emotions of the utterances in conversations are interactive, previous works usually implicitly model the emotion interaction between utterances by modeling dialogue context, but the misleading emotion information from context often interferes with the emotion interaction. We noticed that the gold emotion labels of the context utterances can provide explicit and accurate emotion interaction, but it is impossible to input gold labels at inference time. To address this problem, we propose an iterative emotion interaction network, which uses iteratively predicted emotion labels instead of gold emotion labels to explicitly model the emotion interaction. This approach solves the above problem, and can effectively retain the performance advantages of explicit modeling. We conduct experiments on two datasets, and our approach achieves state-of-the-art performance.


Introduction
Emotion recognition in conversations (ERC) aims to recognize the emotion of each utterance in conversations. Recently it has received much attention due to its applications in various conversation scenes, such as emotional chatbots (Zhou et al., 2018), emotion detection of customers in artificial services (Song et al., 2019), sentiment analysis of comments in social media (Chatterjee et al., 2019), and so on.
Different from the common sentence-level emotion recognition task, ERC is special due to some characteristics. The first one is that the utterances are context dependent, and modeling context can provide more information for emotion recognition (Poria et al., 2017;Jiao et al., 2019). The second characteristic of ERC is that the utterances are speaker-sensitive, thus many researchers modeled the state of speakers and the inter-speaker dependency relations (Hazarika et al., 2018b;Ghosal et al., 2019). In this paper, we observe another characteristic that is the emotions of the utterances are interactive. For example, in Figure 1, the emotion of the utterance from Speaker A can directly influence Speaker B. Thus, modeling the emotion interaction between utterances is helpful for the ERC task.
Previous works usually implicitly model the emotion interaction by modeling dialogue context (Poria et al., 2017;Jiao et al., 2019). However, because of the arbitrariness of the dialogue, the context utterances often convey misleading emotion information when recognizing the emotion of the target utterance, such as in Figure 1(a). To solve this problem, we observe that the gold emotion labels (such as "happy", "angry") of the context utterances can provide explicit and accurate emotion interaction between utterances, such as in Figure 1(b). Thus, we can introduce the emotion labels to explicitly model the emotion interaction between utterances.
However, a challenging problem is this approach requires inputting gold labels of context utterances, which is impossible at inference time. We observe a phenomenon, the utterances which are helpful for the emotion recognition of the target utterance are usually near the target utterance and the number is

Speaker A (Context)
There will still be a blackout today, it is really great.

Speaker B (Current)
Same here, I don't understand what they are doing.

Anger Happy✘ Speaker A (Context)
There will still be a blackout today, it is really great.

Speaker B (Current)
Same here, I don't understand what they are doing.

(a) Implicitly model the emotion interaction (b) Explicitly model the emotion interaction
Figure 1: A short conversation example which shows the difference between two methods of modeling emotion interaction. (a) The emotion prediction of Speaker B is wrong, because of the interference from Speaker A's sarcasm. (b) Due to modeling emotion interaction explicitly, the interference from Speaker A's sarcasm is reduced and the emotion prediction of Speaker B is right.
often limited. As long as the emotion labels of these utterances are correct, the emotion recognition of the target utterance can benefit from our explicit approach. Therefore, we speculate that even if only part of emotion labels are correct, the correct part can still help related utterances recognize their emotions better. This process can be iterated, which will make the emotion recognition result better and better. Experiments in Section 3.5 also confirm this. Based on the above idea, we propose an iterative emotion interaction network for emotion recognition in conversations. This network explicitly models the emotion interaction between utterances, and meanwhile solves the problem of no gold labels at inference time by iterative improvement mechanism. Specifically, we first adopt an utterance encoder to obtain the representations of utterances and make an initial prediction for the emotions of all utterances. Next, we integrate the initial prediction and the utterances by an emotion interaction based context encoder to make an updated prediction for the emotions. Finally, we use the iterative improvement mechanism to iteratively update the emotions, in which a loss function is employed to constrain the prediction of each iteration and the correction behavior between two adjacent iterations.
The contributions of this work are summarized as follows: • We explicitly model the emotion interaction between utterances, which is superior to the previous works implicitly modeling the emotion interaction.
• We propose an iterative emotion interaction network, which not only explicitly models the emotion interaction between utterances, but also solves the problem of no gold labels at inference time.
• We conduct experiments on the IEMOCAP dataset and the MELD dataset. Experimental results show that our approach achieves state-of-the-art performance.

Method
In this section, we introduce our proposed iterative emotion interaction network as shown in Figure 2. Our network consists of three components: an utterance encoder, an emotion interaction based context encoder, and iterative improvement mechanism. The utterance encoder is used to obtain the representations of all utterances in a conversation. The emotion interaction based context encoder introduces the emotion probabilities of the utterances and integrates them and the utterance representations to explicitly model the emotion interaction. The iterative improvement mechanism contains initial emotion prediction, iterative emotion feedback and loss for iteration, which combines the above two encoders to iteratively improve the emotion predictions. In the following sections, we describe these components in detail.  Figure 2: Overview of our proposed iterative emotion interaction network. The utterance encoder is used to obtain the utterance representations. The emotion interaction based context encoder (EC-Encoder) is used to explicitly model the emotion interaction and obtain the updated emotion probabilities. The iterative improvement mechanism (including the initial emotion prediction, the iterative emotion feedback and the loss for iteration ) is used to build the iterative framework and calculate the loss for iteration. These components work together and finally improve the performance.

Utterance Encoder
In our framework, the goal of the utterance encoder is to obtain the representation for each utterance. Suppose, given an utterance u = {w 1 , w 2 , ..., w M } consisting of a sequence of M words, we first obtain the embedding forms {x 1 , x 2 , ..., x M } by fedding them into the word embedding layer, which is initialized by pretrained word embeddings. A BiGRU is used to capture the contextual information from h i into a single vector h i for the word w i , which is defined as follows: To obtain a single vector representation for the utterance u, we aggregate the sequence of hidden states {h 1 , h 2 ..., h M } with an attention mechanism, which can be formulated as follows: where u is the vector representation for the utterance u. Similarly, given a conversation C = {u 1 , u 2 , ..., u N }, the sequence of all utterances can be represented as U = {u 1 , u 2 , ..., u N }.

Emotion Interaction Based Context Encoder
The emotion interaction based context encoder is used to explicitly model the emotion interaction. It introduces the emotion probabilities of the utterances, and integrates them and the utterance representations to achieve this goal. It consists of three components: an emotion embedding layer, a BiGRU encoder, and an emotion classifier. It takes the utterance representations U = {u 1 , u 2 , ..., u N } and the context emotion probabilities P = {p 1 , p 2 , ..., p N } as inputs and then outputs the updated version of P, named P . Thus, it is also the basic unit of iterative improvement in our framework.
Emotion Embedding Let L = {l 1 , l 2 , ..., l |L| } represents the set of emotion labels, and then map each label l i to an embedding vector x i which is the representation of this emotion. For each utterance emotion probability vector p i ∈ P, we define p i = {p 1 i , p 2 i , ..., p |L| i } and then use these as weights to obtain the utterance emotion representation e i , which is a weighted sum of all emotion embeddings: Based on the above, we can obtain the context emotion representations E = {e 1 , e 2 , ..., e N }. BiGRU Encoder For each utterance u i , we concatenate u i ∈ U and e i ∈ E, and feed the result into GRU units. The process can be defined as follows: where h i is the hidden state, which is included in the context hidden states Emotion Classifier For each h i ∈ H, we feed it into the emotion classifier which is a softmax layer: where p i is the updated emotion probability vector, which is included in the updated context emotion probabilities P .

Iterative Improvement Mechanism
The iterative improvement mechanism is the core of our proposed approach. It consists of three parts: initial emotion prediction, iterative emotion feedback, and loss for iteration. These three parts combines the above two encoders to build an iterative framework, which can iteratively improve the emotion predictions. In this section, we introduce these three parts in detail.
Initial Emotion Prediction Generally, the initial value is an important part of the iteration. In our framework, we obtain the initial context emotion probabilities P 0 by feeding the utterance representations U into a softmax layer. The process can be defined as follows: where u i is an utterance representation from U, and p 0 i is the initial emotion probability vector which should be contained in P 0 .
Iterative Emotion Feedback This component is crucial to achieve iterative improvement. As mentioned in Section 2.2, the basic iterative unit is the emotion interaction based context encoder (EC-Encoder), which takes the context emotion probabilities as input and outputs an updated version. The iterative emotion feedback mainly uses the updated context emotion probabilities as the input of the EC-Encoder again, thereby achieves an iterative update of the emotion prediction.
Formally, the process of obtaining the updated context emotion probabilities in the i-th step can be defined as follows: where i ≥ 1, U is the utterance representations, P i−1 is the context emotion probabilities at step i-1, and P i is the context emotion probabilities at step i.
Loss for Iteration To achieve iterative improvement, we design a loss to constrain the prediction of each iteration and the correction behavior between two adjacent iterations.
For each iteration, we use cross-entropy function to obtain the loss: We add margin-ranking loss between two adjacent iterations, which can punish incorrect modification: The final loss can be defined as follows: where T is a hyperparameter which represents the number of iterations, N a is the number of all utterances in the dataset, and |L| is the number of emotions labels. y j denotes the one-hot vector of gold labels, and y j,k is the element of y j for emotion k. Similarly, p i j,k and p i−1 j,k are the elements of p i j and p i−1 j for emotion k. In addition, λ is a hyperparameter that balances two types of losses.

Datasets
We evaluate the performance of our approach on two publicly available datasets, IEMOCAP (Busso et al., 2008) and MELD .
IEMOCAP The IEMOCAP dataset 1 was collected by SAIL lab at USC. It consists of approximately 12 hours of multimodal conversation data, we only use the text modality in this paper. It is grouped into five sessions, we use the first four sessions as the training set and use the last one as the test set. Besides, we use 10% dialogues of the training set as a validation set. The dataset contains 152 dialogues with a total of 7,433 utterances, and it comes with six emotion categories.
MELD The MELD dataset 2 contains the conversations from Friends TV show transcripts, which is a multimodal extension of the EmotionLines dataset (Hsu et al., 2018). In this paper, we only use the text modality. The dataset contains 1,433 dialogues with a total of 13,708 utterances, and it comes with seven emotion categories.

Baselines
We compare our approach with the following baselines: CNN (Kim, 2014) This is a CNN model trained on context-independent utterances, hence the contextual information is unused in this baseline.
cLSTM (Poria et al., 2017) This is a context-level LSTM model. This baseline uses CNN to extract context-independent utterance features, and uses LSTM to capture contextual features for emotion recognition.  cLSTM+CRF This is a modified model based on cLSTM (Poria et al., 2017). We add a CRF (Conditional Random Fields) layer after the contextual LSTM, so that this baseline could capture the dependencies between emotion labels.
DialogueRNN  This is a RNN-based model, which uses three GRUs to model the speaker, the context given by the preceding utterances, and the emotion behind the preceding utterances. This baseline could set separate states for each speaker and associate states with the speaker's utterance. In our experiment, we use the open-source codes 3 of DialogueRNN provided by the authors.
DialogueGCN (Ghosal et al., 2019) This is a GCN-based model. This baseline uses a GCN to model the conversation, the nodes in the graph represent utterances, and the types of edges are determined based on the speaker information. In our experiment, we use the open-source codes 4 of DialogueGCN provided by the authors.

Experimental Settings
In our experiment setting, we use the pretrained 840B GloVe embedding (Pennington et al., 2014) to initialize the 300 dimensional word embedding layer, and we set the emotion embedding dimension to 32. In utterance encoder, the hidden size of GRU is 50 for IEMOCAP and 100 for MELD. In emotion interaction based context encoder, the hidden size of GRU for two datasets is 132 and 232 respectively.
We use Adam (Kingma and Ba, 2015) to optimize the parameters in our models, and use a minibatch size of 32. To regulate our models, we set the weight decay to 0.0001, and apply dropout with a dropout rate at 0.1. Based on validation performance on IEMOCAP, the learning rate is set to 0.0002, the hyperparameter λ is set to 50, and the maximum iteration number T is set to 3. Based on validation performance on MELD, the learning rate is set to 0.0001, the hyperparameter λ is set to 5, and the maximum iteration number T is set to 2.

Overall Results
We compare our approach with the baseline methods on IEMOCAP and MELD datasets, and the experimental results are shown in Table 1 and Table 2 respectively. We report the F1-score for each emotion class, and evaluate the overall performance using weighted average F1. For each result of our approach, we repeat the experiment 12 times to get the average value. For fair comparison, we re-run all baseline methods with the same setting. Therefore, the results of baseline methods are slightly different from those in original papers. Table 1 presents the results on IEMOCAP dataset. Among all baseline methods, DialogueGCN achieves the best overall performance of 63.16% on weighted F1 score. In comparison, the performance of our approach outperforms DialogueGCN by 1.21%, which can preliminarily prove the effectiveness of our proposed approach. In addition, our approach obtains improvements on most emotion classes compared to baseline methods, although some performance degradations occur on anger and excited. But overall, our approach balances them well and improves overall performance. Table 2 presents the results on MELD dataset. The cLSTM model achieves the best overall performance of 59.33% on weighted F1 score among all baseline methods. In comparison, the performance of our approach outperforms cLSTM by 1.39%. Similar to the results on IEMOCAP dataset, our approach also achieves the best performance on most emotion classes. In particular, though emotion class disgust only contains a few utterances, our approach improved the performance greatly (about 10%), which shows that our approach has the capability to recognize the emotions of minority classes.

Analysis
Our proposed approach achieves state-of-the-art performance. In this section, we analyze our approach from the following aspects.
Effectiveness of Emotion Interaction We analyze the effectiveness of modeling the emotion interaction on both datasets, and the experimental results are shown in the Table 3. First, we train a model based on our proposed network without emotion embedding and iterative emotion feedback, denoted No Label. This model represents an extreme case where the emotion labels are not used at all, which is the case of most implicit modeling emotion interaction methods. Second, we train a model without iterative emotion feedback but initialize context emotion representations with gold emotion labels, denoted Gold Label. This model represents another extreme case where the emotion labels are optimally used, which is the best way to explicitly model the emotion interaction, but it is impossible at inference time. From the results, we can see that: 1) The performance of No Label is the worst, the performance of Gold Label is the best, indicating that explicit modeling has more advantages than implicit modeling. 2) The performance of our approach falls somewhere in between, indicating that our iterative improvement mechanism can not only solve the problem of no gold labels, but also effectively retain the performance advantages of explicit modeling.  Table 3: An analysis of the effectiveness of emotion interaction on two datasets. Best values among our models are highlighted in bold.

Impact of Maximum Iteration Number
We plot the performance trends of our approach with increasing the maximum iteration number on both datasets. As presented in Figure 3, the performance shows a trend of increases at first and then decreases, and the best performance is obtained when the maximum iteration number is 3 for IEMOCAP and 2 for MELD. This result shows that appropriately increasing the maximum iteration number can gradually improve the performance, which is consistent with our expectation. However, too many iterations lead to a decrease in performance. This phenomenon is also consistent with our expectation, and one possible explanation is that too many iterations will lead to overfitting on the training set.

Analysis of Iterative Correction Behavior
We analyze the iterative correction behavior of our approach when the maximum iteration number is fixed, the performance of each step and the correction behavior between two adjacent steps are shown in the Table 4. For IEMOCAP and MELD datasets, we select the models with the maximum iteration number of 3 and 2 for analysis, respectively. From the results, we can see that: 1) The performance of each step gradually increases on both datasets, which shows that the iterative improvement mechanism works. 2) Among the changes of predicted emotion labels between all two adjacent steps on both datasets, the cases which are changed from wrong to right (W → R) are the most. This shows that our approach does make effective emotion prediction correction in the iterative process.

Case Study
We give a case study to illustrate the effectiveness of our proposed iterative improvement mechanism, as is shown in Table 5. We present a sample dialogue from the MELD test set and show the emotion labels predicted by our approach at each step, where the maximum iteration number is set to 2. It can be seen that the emotion prediction result of the first step has more errors, which is a less accurate result. As the iteration number increases, the situation of prediction errors is gradually corrected. Such as in the 8th utterance said by Joey, the emotion of this utterance is difficult to judge only based on its text, and the Joey She goes and makes a date on the same night she has plans with me? neutral anger anger 10 Joey I think she's trying to pull a fast one on Big Daddy. anger anger anger Table 5: An example of emotion prediction from the MELD test set output by our approach.
prediction of the first step is wrong. However, the prediction of the second step is modified to be correct, which is due to the context utterances and the anger emotion of the 10th utterance predicted correctly in the first step. This case shows that the iterative improvement mechanism is effective.

Related Work
Our work focuses on emotion recognition in conversations (ERC), which requires considering some characteristics in conversations. Early works on ERC noticed that dialogue context can provide more information. Poria et al. (2017) proposed the c-LSTM model, which used LSTM model to capture contextual features. Jiao et al. (2019) suggested the HiGRU model, which introduced a word-level GRU and an utterance-level GRU with self-attention and features fusion. Especially, Zhong et al. (2019) proposed the KET model, which introduced external commonsense knowledge to the ERC task. Qin et al. (2020) proposed the DCR-Net model, which improved the performance of the ERC task through multi-task learning. Recent works found that the state of the speakers and the inter-speaker dependency relations also need to be considered. These works can be divided into two categories: RNN-based models and GCN-based models. RNN-based models include CMN (Hazarika et al., 2018b), ICON (Hazarika et al., 2018a), and DialogueRNN . CMN and ICON used different GRUs for both parties in the conversation and used memory networks to fuse the contextual information. DialogueRNN set separate states for each speaker and associated states with the speaker's utterance. GCN-based models include ConGCN  and DialogueGCN (Ghosal et al., 2019). ConGCN represented each utterance and each speaker as a node and linked the utterances to the speakers by undirected edges. DialogueGCN also used a GCN to model the conversation. The graph constructed by DialogueGCN contains only utterance nodes, but the type of edge is determined based on the speaker information. Different from these works, we focus on another characteristic that is the emotion interaction between utterances and propose an iterative emotion interaction network to explicitly model it. The related works usually model dialogue context, which only implicitly model the emotion interaction. Therefore, the motivations and practices of our work are different from the related works.

Conclusion
In this paper, we explicitly model the emotion interaction between utterances in ERC. To solve the problem of no gold emotion labels at inference time, we propose an iterative emotion interaction network, which uses iteratively predicted emotion labels instead of the gold emotion labels. The network consists of three components: the utterance encoder, the emotion interaction based context encoder, and the iterative improvement mechanism. These components work together and finally iteratively improve the emotion predictions. Experimental results on two datasets show that our approach achieves state-of-theart performance, and extensive analysis further proves the effectiveness of our approach.