COSMIC: COmmonSense knowledge for eMotion Identification in Conversations

In this paper, we address the task of utterance level emotion recognition in conversations using commonsense knowledge. We propose COSMIC, a new framework that incorporates different elements of commonsense such as mental states, events, and causal relations, and build upon them to learn interactions between interlocutors participating in a conversation. Current state-of-theart methods often encounter difficulties in context propagation, emotion shift detection, and differentiating between related emotion classes. By learning distinct commonsense representations, COSMIC addresses these challenges and achieves new state-of-the-art results for emotion recognition on four different benchmark conversational datasets. Our code is available at https://github.com/declare-lab/conv-emotion.


Introduction
Emotion recognition is a long-standing research problem in Artificial Intelligence (AI). With the growing popularity of conversational AI research, the topic of emotion recognition in conversations has received significant attention from the research community Ghosal et al., 2019;Zhang et al., 2019). Identifying emotions in conversations is a core step toward fine-grained conversation understanding, which in turn is essential for downstream tasks such as emotion-aware chat agents , visual question answering (Tapaswi et al., 2016;Azab, 2019), health conversations (Althoff et al., 2016;Pérez-Rosas et al., 2017) and others.
Natural conversations are complex as they are governed by several distinct variables that affect the flow of a conversation and the emotional dynamics of the participants. These variables include Figure 1: Commonsense knowledge can lead to explainable dialogue understanding. It will help models to understand, reason, and explain events and situations. In this particular example, commonsense inference is applied to a sequence of utterances in a twoparty conversation. Person A's first utterance indicates that he/she is tired of arguing with person B. The tone of the utterance also implies that person B is getting yelled at by person A, which invokes a reaction of irritation in person B. Person B then asks what he/she can do to help and says this while being angry. This again makes person A annoyed and influences him/her to respond with anger. This kind of inferred commonsense knowledge about the reaction, effect, and intent of the speaker and the listener helps in predicting the emotional dynamics of the participants. topic, viewpoint, speaker personality, argumentation logic, intent, and so on (Poria et al., 2019b). Additionally, individual utterances are also governed by the mental state, intent, and emotional state of the participants at the time when they are uttered. In this conversation model, only the utterances can be observed as the conversation unfolds, while other variables such as speaker state and intent remain latent as they are not directly observed by the other participants. Similarly, the emotional state of the speakers cannot be directly observed, but it can be inferred from the utterances that are observable. 1 The commonsense knowledge of the participants in a conversation plays a central role in inferring the latent variables of a conversation. It is used to guide the participants through their reasoning about the content of the conversation, dialog planning, decision making, and many other reasoning tasks. It is also used to recognize other finer-grained elements of a conversation, such as avoiding repetition, asking questions, refraining from giving unrelated responses, and so on -all of which control aspects of the conversation such as fluency, interestingness, inquisitiveness, or empathy. Commonsense knowledge is thus necessary to model the nature and flow of the dialogue and the emotional dynamics of the participants. In Figure 1, we illustrate one such scenario where commonsense knowledge is utilized to infer emotions of the utterances in a dialogue.
Natural language is often indicative of one's emotion. Hence, emotion recognition has been enjoying popularity in the field of NLP (Kratzwald et al., 2018;Colneriĉ and Demsar, 2018), due to its widespread applications in opinion mining, recommender systems, healthcare, and so on. Only in the past few years has emotion recognition in conversation (ERC) gained attention from the NLP community (Yeh et al., 2019;Chen et al., 2018;Zhou et al., 2018) due to the growing availability of public conversational data. ERC can be used to analyze conversations that take place on social media. It can also aid in analyzing conversations in real time, which can be instrumental in legal trials, interviews, ehealth services, and more. Unlike vanilla emotion recognition of sentences/utterances, ERC ideally requires context modeling of the individual utterances. This context can be attributed to the preceding utterances, and relies on the temporal sequence of utterances. Compared to the recently published works on ERC (Chen et al., 2018;Zhou et al., 2018;Qin et al., 2020;Zhong et al., 2019;Zhang et al., 2019), both lexicon-based (Wu et al., 2006;Mohammad and Turney, 2010;Shaheen et al., 2014) and modern deep learning-based (Kratzwald et al., 2018;Colneriĉ and Demsar, 2018) vanilla emotion recognition approaches fail to work well on ERC datasets as this work ignores the conversation specific factors such as the presence of contextual cues, the temporality in speakers' turns, or speaker-specific information.
In this paper, we introduce COSMIC, a commonsense-guided framework for emotion identification in conversations. By building upon a very large commonsense knowledge base, our proposed framework captures some of the complex interactions between personality, events, mental states, intents, and emotions leading towards a better understanding of the emotional dynamics and other aspects of conversation. Through extensive evaluations on four different conversation datasets and comparisons with several baselines and stateof-the-art models, we show the effectiveness of a model that explicitly accounts for commonsense. Moreover, feature ablation experiments highlight the role that such knowledge plays in identifying emotion in conversations.

Related Work
Emotion recognition has been an active area of research for many years and has been explored across inter-disciplinary fields such as machine learning, signal processing, social and cognitive psychology, etc (Picard, 2010). The seminal work from Ekman (1993) presented findings on facial expressions, methods to measure facial expression and their relation with human emotion. Acoustic information and visual cues were later used for emotion recognition by Datcu and Rothkrantz (2014).
However, emotion recognition in conversations has gained popularity only recently due to the emergence of publicly available conversational datasets collected from social media platforms and scripted situations such as movies and tv-shows (Poria et al., 2019a;Zahiri and Choi, 2018). The main approach towards conversational emotion recognition is to perform contextual modeling in either textual or multimodal setting with deep-learning based algorithms. Poria et al. (2017) used recurrent neural networks for multimodal emotion recognition followed by , where party and global states were used for modeling the emotional dynamics. An external knowledge base was used in (Zhong et al., 2019) with transformer networks to perform emotion recognition. Some of the other important works include (Hazarika et al., 2018a,b;Chen et al., 2017;Zadeh et al., 2018a).

Task definition
Given the transcript of a conversation along with speaker information for each constituent utterance, the ERC task aims to identify the emotion of each utterance from a set of pre-defined emotions. Figure 1 illustrates one such conversation between two people, where each utterance is labeled by the underlying emotion. Formally, given an input sequence of N utterances [(u 1 , p 1 ), (u 2 , p 2 ), . . . , (u N , p N )], where each utterance u i = [u i,1 , u i,2 , . . . , u i,T ] consists of T words u i,j spoken by party p i , the task is to predict the emotion label e i of each utterance u i . In conversational emotion recognition, the task is to classify each of the constituting utterances into its appropriate emotion category. In literature, the main approach towards this problem has been to first produce context independent representations and then perform contextual modeling. We identify these two distinct modeling phases and aim to improve both of them through the proposed COSMIC framework. Our framework consists of three main stages: 1. Context independent feature extraction from pretrained transformer language models.
2. Commonsense feature extraction from a commonsense knowledge graph.
3. Incorporating commonsense knowledge to design better contextual representations and using it for the final emotion classification.
The overall architecture of the COSMIC framework is illustrated in Figure 2.

Context Independent Feature Extraction
We employ the RoBERTa model (Liu et al., 2019) to extract context independent utterance level feature vectors. We first fine-tune the RoBERTa Large model for emotion label prediction from the transcript of the utterances. RoBERTa Large follows the original BERT Large (Devlin et al., 2018) architecture having 24 layers, 16 self-attention heads in each block and a hidden dimension of 1024, resulting in a total of 355M parameters. Let an utterance x consists of a sequence of BPE tokenized tokens x 1 , x 2 , . . . , x N , with emotion label E x . In this setting, the fine-tuning of the pretrained RoBERTa model is realized through a sentence classification task. A special token [CLS] is appended at the beginning of the utterance to create the input sequence for the model: This sequence is passed through the model, and the activation from the last layer corresponding to the [CLS] token is then used in a small feedforward network to classify it into its emotion class E x .
Once the model has been fine-tuned for emotion label classification, we pass the [CLS] appended BPE tokenized utterances to it and extract out activations from the final four layers corresponding to the [CLS] token. These four vectors are then averaged to obtain the context independent utterance feature vector with a dimension of 1024.  Table 1: Functional notations of commonsense knowledge used in COMET. The functions take as input the utterance u and returns the feature indicated in the leftmost column. Intent and effect on speaker and listeners can be categorized into mental states, whereas their reactions are events. Intent is also a causal variable whereas the rest are effects.

Commonsense Feature Extraction
In this work, we use the commonsense transformer model COMET (Bosselut et al., 2019) to extract the commonsense features. COMET is trained on several commonsense knowledge graphs to perform automatic knowledge base construction. The model is given a triplet {s, r, o} from the graph and is trained to generate the object phrase o from concatenated subject phrase s and relation phrase r. COMET is an encoder-decoder model that uses the pretrained autoregressive language model GPT (Radford et al., 2018) as the base generative model.
To perform the task of generative commonsense knowledge construction, COMET is trained on ATOMIC (The Atlas of Machine Commonsense) , a collection of everyday inferential if-then commonsense knowledge organized through textual descriptions. ATOMIC consists of nine different if-then relation types to distinguish agents vs themes, causes vs effects, voluntary vs non-voluntary events, and actions vs mental states. Given an event in which X participates, the nine relation types (r) are inferred as follows: i) intent of X, ii) need of X, iii) attribute of X, iv) effect on X, v) wanted by X, vi) reaction of X, vii) effect on others, viii) wanted by others, and ix) reaction of others. As an example, given an event or subject phrase (s): "Person X gives Person Y a compliment", the inference from COMET for relation phrase (r): intent of X and reaction of others would be "X wanted to be nice" and "Y will feel flattered" respectively.
COMET is a generative model and as illustrated in the above example it produces a discrete sequence of commonsense knowledge conditioned on the subject and relation phrase. In our model however, we make use of continuous vectors of commonsense representations. For that, we take the pretrained COMET model on ATOMIC knowledge graph and discard the phrase generating decoder module. We treat utterance U as the subject phrase and concatenate it with the relation phrase r. Next, we pass the concatenated {U ⊕ r} through the encoder of COMET and extract out the activations from the final time-step. In particular we use the relations presented in Table 1: intent of X, effect on X, reaction of X, effect on others and reaction of others (where X is the speaker and others are listeners). Performing this feature extraction operation results in five different vectors (respective to the five different relations) for each utterance in the conversation. These vectors are 768 dimensional.
The nature of the various relation types in ATOMIC allows us to extend it naturally to conversational frameworks. The relations enable the modeling of phenomenons such as content (event, persona, mental states) and causal relations (cause, effect, stative) which are essential elements for understanding conversational context. These different relations are of key importance because generally there is a major interplay between virtually all of them throughout the course of a conversation. For instance, the relations i) -vi) are all intrinsically related to the speaker and vii) -ix) are all akin to the listener. On a more fine-grained level, the intent, effect and react components of the speaker and listener are all elemental for understanding the nature of the conversation. We surmise that adopting these relational variables in a unified framework would be highly useful to create enhanced representations of the conversation.

Commonsense Conversational Model
We first introduce our notations and present a high level view of the main architecture of our COSMIC model. A conversation consists of N utterances u 1 , u 2 , . . . , u N , in which M distinct speakers/participants p 1 , p 2 , . . . , p M take part. Utterance u t is spoken by participant p s(ut) . For every t ∈ {1, 2, . . . , N }, we denote context independent RoBERTa vectors by x t . Commonsense vectors corresponding to intent of X, effect on X, reaction of X, effect on others and reaction of others are denoted by IS cs (u t ), ES cs (u t ), RS cs (u t ), EL cs (u t ), and RL cs (u t ) respectively. X is assumed to be the speaker and others are assumed to be the listeners.
Since conversations are highly sequential in nature and contextual information flows along a sequence, a context state c t and attention vector a t are formulated that model the sequential dependency between utterances. The context state and attention vector are always shared between all the participants of the conversation.
An internal state, external state and intent state are used to model different mental states, actions and events for the participants. These are represented by q k,t , r k,t and i k,t for the participants k ∈ [1, 2, . . . , M ]. The internal state and the external state can be collectively considered as the speaker state. This states are necessary to capture the complex mental and emotional dynamics of the participants. The emotion state e t is then modelled from a combination of the three states and the immediate preceding emotion state. Finally the appropriate emotion class for the utterance is inferred from the emotion state.
In our framework, context and commonsense modeling is performed using GRU cells (Chung  . GRU cells take as input y t and update its hidden state from h t−1 to h t using the transformation: h t = GRU (h t−1 , y t ). New hidden state h t also serves as the output of the current step. The cell is parameterized by weights W and biases b of appropriate sizes depending upon the input y t and output h t . We use five Bidirectional GRU cells GRU C , GRU Q , GRU R , GRU I , and GRU E for modeling context state, internal state, external state, intent state, and emotion state respectively. For ease of representation we formulate the different states with unidirectional GRU cells here.
Context State: The context state stores and propagates the overall utterance-level information along the sequence of the conversation flow. This state is updated using context GRU cell GRU C after each time-step t when the utterance is uttered by some participant p s(ut) . RoBERTa feature vector x t , internal state q s(ut),t−1 , and external state r s(ut),t−1 of the speaker from the immediate previous timestep (just before uttering the utterance) are concatenated and serve as the input vector for GRU C .
c t = GRU C (c t−1 , (x t ⊕ q s(ut),t−1 ⊕ r s(ut),t−1 )) (1) We also pool attention vector a t from the history of context [c 1 , c 2 , . . . , c t−1 ] using soft-attention. This attention vector is later used to perform updates on internal and external states.
Internal State: The internal state of the participants is conditioned on how the individual is feeling and what is the effect perceived from other participants. This state may remain concealed, as participants may not always express explicitly their feeling or outlook through external stance or reactions. Apart from feelings, this state can also be considered to include aspects that the participant actively tries not to express or features that are considered common knowledge and don't require explicit communication. The effect on oneself is thus elemental to represent the internal state of the participants. We model the internal state of the participants using GRU Q . For time-step t, the internal state of the speaker p s(ut) is updated by taking into account the attention vector a t and commonsense vector effect on speaker ES cs (u t ) q s(ut),t = GRU Q (q s(ut),t−1 , (a t ⊕ ES cs (u t ))) For all the other participants apart from the speaker, this update is performed using effect on listeners EL cs (u t ).
q j,t = GRU Q (q j,t−1 , (a t ⊕EL cs (u t ))); ∀j = s(u t ) (4) External State: Unlike the internal state, the external state of the participants is all about the expressions, reactions, and responses. Naturally, this state can be easily seen, felt, or understood by the other participants. For instance, the actual utterance, the manner of articulation, the speech, and other acoustic features, the visual expression, gestures, and stance can all be loosely considered to fall under the regime of external state. GRU R updates the external state of the speaker p s(ut) by taking as input the concatenation of attention vector a t , utterance vector x t and commonsense vector reaction of speaker RS cs (u t ) r s(ut),t = GRU R (r s(ut),t−1 , (a t ⊕x t ⊕RS cs (u t ))) (5) For listeners, this update is performed using reaction of listeners RL cs (u t ).
Intent State: Intent is a mental state that represents the commitment to carry out a particular set of actions. The intent of the speaker always plays a crucial role in determining the emotional dynamics of a conversation. The intent of the speaker changes from i s(ut),t−1 to i s(ut),t at time-step t. This change is invoked by the commonsense intent of speaker vector IS cs (u t ) and internal speaker state q s(ut),t at that respective time-step t. The intent states are captured by GRU cell GRU I : The intent of the listener(s), however, is kept unchanged. This is because the intent of a participant who is silent should not change. The change should occur only when the particular participant speaks again.
Emotion State: The emotional state determines the emotional mood of the speaker and the emotion class of the utterance. We posit that the emotional state depends upon the utterance and composite state of the speaker that takes into account the internal, external, and intent state. Naturally the current emotion state also depends on the previous emotion state of the speaker. GRU E captures the emotion state by combining all of the factors as following, e t = GRU E (e t−1 , (x t ⊕q s(ut),t ⊕r s(ut),t ⊕i s(ut),t )) (9) Emotion Classification: Finally all the utterances in the conversation are classified with a fully connected network from e t P t = sof tmax(W smax e t + b smax ); ∀t ∈ [1, N ]  We benchmark COSMIC on four different conversational emotion recognition datasets: i) IEMO-CAP (Busso et al., 2008) ii) MELD (Poria et al., 2019a) iii) DailyDialog (Li et al., 2017), and iv) EmoryNLP (Zahiri and Choi, 2018). IEMOCAP and DailyDialog are two-party datasets, whereas MELD and EmoryNLP are multi-party datasets. We report experimental results for conversational emotion recognition from the textual information for all four datasets. Information about the datasets is shown in Table 3.
IEMOCAP (Busso et al., 2008) is a dataset of two person conversations among ten different unique speakers. The train set dialogues come from the first eight speakers, whereas the test set dialogues are from the last two. Each utterance is annotated with one of the following six emotions: happy, sad, neutral, angry, excited, and frustrated.
DailyDialog (Li et al., 2017) covers various topics about our daily life and follows the natural human communication approach. All utterances are labeled with both emotion categories and dialogue acts. The emotion can belong to one of the following seven labels: anger, disgust, fear, joy, neutral, sadness, and surprise. The dataset has over 83% neutral labels and these are excluded during Micro-F1 evaluation.
MELD (Poria et al., 2019a) is a multimodal dataset extended from the EmotionLines dataset (Chen et al., 2018). MELD is collected from the TV show Friends and has more than 1400 dialogues and 13000 utterances. Utterances are labeled with emotion and sentiment classes. The emotion classes belong to anger, disgust, sadness, joy, surprise, fear, or neutral, and the sentiment classes belong to positive, negative or neutral.
EmoryNLP (Zahiri and Choi, 2018) is another dataset also based on the show Friends. Utterances in this dataset are annotated on seven and three emotion classes. The seven emotion classes are neutral, joyful, peaceful, powerful, scared, mad and sad. To create three emotion classes: joyful, peaceful, and powerful are grouped together to form the positive class; scared, mad and sad are grouped together to form the negative class; and the neutral class is kept unchanged.

Training Setup
For context independent feature extraction, the RoBERTa model is fine-tuned on the set of all utterances and their emotion labels in the training data. We fine-tune the RoBERTa model for a batch size of 32 utterances with Adam optimizer with learning rate of 1e-5. In the case of MELD and EmoryNLP datasets, we use a residual connection between the first and the penultimate layer which brings more stability in the training in the emotion recognition model. The emotion recognition model is trained with Adam optimizer having a learning rate of 1e-4.

Baseline and State-of-the art Methods
For a comprehensive evaluation of COSMIC, we compare it against the following methods: CNN (Kim, 2014) is a convolutional neural network model trained on top of pretrained GloVe embeddings. Standard configurations of filter sizes are used. The model is trained at the utterance level to predict the emotion classes. ICON (Hazarika et al., 2018b) uses two GRU networks to learn the utterance representations for dialogues between two-participants. The output of the two speaker GRUs is then connected using another GRU that helps in performing explicit inter-speaker modeling. ICON is limited to conversations with only two participants only. KET (Zhong et al., 2019) or Knowledge enriched transformers dynamically leverages external commonsense knowledge using hierarchical self-attention and context aware graph attention. ConGCN (Zhang et al., 2019) considers utterances and participants of a conversation as nodes of graph network and models both context and speaker sensitive dependence for emotion detection. BERT DCR-Net (Qin et al., 2020) is a deep co-interactive relation network that uses BERT based features for joint dialogue act recognition and emotion (sentiment) classification. A relation layer learns to explicitly model the relation and interaction between these two tasks in a multi-task setting. BERT+MTL ) is a multi-task learning framework where features extracted from BERT are used in a recurrent neural network for emotion recognition and speaker identification. DialogueRNN  models the emotion of utterances in a conversation with speaker, context and emotion information from neighbour utterances. These factors are modeled using three separate GRU networks to keep track of the individual speaker states.
We report and compare the performance of COSMIC on test data in Table 4. State-of-the-art models use GloVe embeddings to extract contextindependent features. As features extracted from transformer based networks such as BERT and RoBERTa generally outperform traditional word embeddings such as word2vec and GloVe, we also report results of the models when used with BERT or RoBERTa features.  MELD and EmoryNLP: These two datasets have been annotated from the TV show Friends, and utterances are often very short. Although dialogues occasionally contain emotion specific words, this does not happen very often at the utterance level. Naturally, emotion dynamics are highly contextual in nature and almost always depend on surrounding utterances. It has been observed in previous work that emotion modeling in MELD is difficult because often there are a lot of speakers in each conversation but they utter only a small number of utterances. Sophisticated models such as Dia-logueRNN do not bring as much improvement over CNN as they do on IEMOCAP. We observe that, COSMIC brings a large improvement over other models on the fine-grained (7 class) classification setup for both datasets. It achieves new state-ofthe-art weighted F1 scores of 73.20 and 56.51 on three class classification; 65.21 and 38.11 on seven class classification on MELD and EmoryNLP.

The Role of Commonsense
In Table 4, we also report results of ablation studies by removing listener-specific and speaker-specific commonsense components. For speaker ablation, we discard IS cs (u t ), ES cs (u t ), RS cs (u t ), and observe a sharp drop in performance in most cases. For listener ablation, we discard EL cs (u t ), and RL cs (u t ) and find that the performance also drops but not as much as the speaker ablation. In fact, listener ablation leads to slight improvement in performance in EmoryNLP. The results suggest that speaker-specific commonsense has a greater impact in the overall performance of COSMIC, which is expected because we are predicting the emotion class of the speaker at each utterance. Finally, ablation with respect to both components at the same time naturally leads to higher drop in overall performance.

Case Study
We illustrate a case study on a test conversation instance from the IEMOCAP dataset in Figure 3. The conversation begins with a couple of neutral utterances, but then the situation quickly escalates, and finally, it ends with a lot of angry and frustrated utterances from both the speakers. State-of-the-art models like DialogueRNN often find this kind of scenarios difficult, when there is a couple of sudden emotions shifts in between (neutral to frustrated and then neutral again). These models also tend to misclassify utterances that have subtle differences in emotion classes such as frustrated and angry. In COSMIC, the propagation of commonsense knowledge makes it easier for the model to handle the  Figure 3: Case study from the IEMOCAP dataset. Discrete commonsense sequences are shown for more interpretability. Commonsense knowledge helps in predicting emotion shifts and understanding difference between closely related emotion classes such as angry and frustrated.
sudden transitions and to understand the subtle difference between closely related emotion classes. In Figure 3, for the first utterance, the commonsense model predicts that the reaction of speaker is annoyed and propagation of this information helps in predicting that the speaker's next utterance actually belongs to the frustrated class. Similarly for the rest of the illustrated utterances, the commonsense knowledge from effect on speaker and reaction of listener helps the model in distinguishing and predicting the anger and frustrated classes correctly.

Strategies to Incorporate Commonsense
Apart from the five commonsense features that we use in COSMIC (Table 1), there are four other features that can be extracted from COMET: attribute of speaker, need of speaker, wanted by speaker, and wanted by listeners. We incorporate them using different strategies that add extra complexity in our framework but ultimately do not improve the performance by a significant margin. We experimented along the following directions: • Attribute of speaker is loosely considered as a personality trait. This latent variable influenced the internal, external and intent states. We find that the discrete attribute features from COMET are mostly a single word like 'stubborn', 'patient', 'argumentative', 'calm', etc and they change quite abruptly for the same participant in continuing utterances. Hence, we find that their vectorized representations do not help much.
• Need of speaker, wanted by speaker, and wanted by listeners are considered as output variables that are to be predicted from the input utterance and the five basic commonsense features (Table 1). We add auxiliary output functions and jointly optimize the emotion classification loss with mean-squared loss between predictions and reference commonsense vectors. This strategy also does not help much in improving the emotion classification performance.
Although the performance improvement is observed using commonsense knowledge across the datasets, this improvement is not very substantial. In the future, we plan to identify better commonsense knowledge sources and develop models that can infuse this knowledge into deep learning models more efficiently.

Conclusion
In this work, we presented COSMIC, a framework that models various aspects of commonsense knowledge by considering mental states, events, actions, and cause-effect relations for emotion recognition in conversations. Using commonsense representations, our model alleviates issues such as difficulty in detecting emotion shifts and misclassification between related emotion classes that are often present in current RNN and GCN based methods. COSMIC achieves new state-of-the-art results for emotion recognition across several benchmark datasets.