MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations

Emotion and sentiment classification in dialogues is a challenging task that has gained popularity in recent times. Humans tend to have multiple emotions with varying intensities while expressing their thoughts and feelings. Emotions in an utterance of dialogue can either be independent or dependent on the previous utterances, thus making the task complex and interesting. Multi-label emotion detection in conversations is a significant task that provides the ability to the system to understand the various emotions of the users interacting. Sentiment analysis in dialogue/conversation, on the other hand, helps in understanding the perspective of the user with respect to the ongoing conversation. Along with text, additional information in the form of audio and video assist in identifying the correct emotions with the appropriate intensity and sentiments in an utterance of a dialogue. Lately, quite a few datasets have been made available for dialogue emotion and sentiment classification, but these datasets are imbalanced in representing different emotions and consist of an only single emotion. Hence, we present at first a large-scale balanced Multimodal Multi-label Emotion, Intensity, and Sentiment Dialogue dataset (MEISD), collected from different TV series that has textual, audio and visual features, and then establish a baseline setup for further research.


Introduction
With the advancements in Artificial Intelligence (AI), the gap between Natural Language Processing (NLP) and Computer Vision (CV) has been bridged by extensive research in multi-modal information analysis. The ability to use different modalities such as text, audio and video for different tasks, such as emotion classification (Tripathi and Beigi, 2018;Hazarika et al., 2018a), sentiment analysis (Poria et al., 2017), dialogue generation (Yoshino et al., 2019;Das et al., 2017) have helped in building robust systems. The potential to understand correct emotion and sentiment in a conversation is crucial for developing strong human-machine interaction systems. Dialogue systems are of two types i.e., goal-oriented systems (Asri et al., 2017) or open chit-chat systems (Serban et al., 2017). In both these systems, understanding the user's emotions is crucial to maximizing the user experience and satisfaction. Nowadays, there is a huge demand for developing social agents capable of having real conversations with humans. With the rapid growth in technology, personal assistants in smartphones such as Amazon's Alexa, Apple's Siri, and Google's Home have become human companions. Hence, these applications need to understand the correct emotional state of the user to increase user contentment leading to user retention.
Emotions and sentiments are subjective qualities and are understood to share overlapping features; hence are frequently used interchangeably. This is mainly because both sentiment and emotion refer to experiences resulting from the combination of biological, cognitive, and social influences. Though both are considered to be the same, yet according to (Munezero et al., 2014), the sentiment is formed and retained for a longer duration, whereas emotions are like episodes that are shorter in length. Moreover, the sentiment is mostly target-centric, while emotions are not always directed to a target. Previously, sentiment and emotions have been tackled separately, although they are different but closely related.
Lately, emotion detection and sentiment analysis in multimodal systems using audio, video, and textual features have gained popularity. But both these tasks have not been explored in depth for conversations. The main reason for this is the unavailability of a large-scale multi-modal dialogue dataset labeled with emotions and sentiment to facilitate research in this direction. Also, identifying emotions and sentiments in conversations is a challenging task compared to tweets or sentences. This is mainly because the contextual information or past utterances may influence the emotions of the present utterance. Also, emotional state change among the speakers in a conversation makes it difficult to identify the emotions and sentiment of an utterance in a dialogue.
With the release of Multimodal EmotionLines Dataset (MELD), research in emotion and sentiment identification in conversations has gained immense attention. This dataset comprises the conversations taken from the Friends TV series labeled with sentiment and emotion using text, audio, and video information. The dataset provides multimodal information for classifying emotions and sentiments in dialogues. This dataset is made using a comedy TV series; it is unbalanced in its emotion distribution, making the dataset imbalanced. Human emotions are extremely complex; therefore, it is highly probable that they express multiple emotions in a single utterance. There is a huge possibility that multiple emotions expressed in an utterance are correlated. For example, the speaker may express the emotion "anger" and "disgust" often together than in isolation. Also, the intensity of the different emotions in a given utterance may vary. For example, the speaker, in some cases, express "anger" with higher intensity while "disgust" with lower intensity or vice-versa. The MELD dataset is labeled with a single emotion only, thereby not providing the complete emotional information in a given utterance.
For building robust emotion and sentiment classification systems, it is crucial to have a balanced dataset labeled with sentiment and multiple emotions along with their corresponding intensity to provide the complete affective information of a given utterance. Hence, in this work, we propose a large-scale balanced Multimodal Multi-label Emotion, Intensity, and Sentiment Dialogue (MEISD) dataset labeled with multiple emotions, intensity, and sentiments using textual, audio, and visual information, collected from 10 TV series belonging to different genres. Only textual information is not enough for understanding emotions, as emotion is also expressed through facial expressions, gestures, pitch, and tone. For example, the given utterance "Great, you are here" can exhibit different emotions, such as joy, anger, or surprise. Hence, it is difficult to identify the correct emotion using only the textual information. Hence, the sentiment label of these utterances is also ambiguous. It is essential to simultaneously focus on these utterances' audio and visual counterparts for identifying the correct emotions and sentiment label of these utterances. An example of a conversation from the MEISD dataset labeled with sentiment and multiple emotions, and their corresponding intensity is given in Figure 1. As it is evident from the given example, visual information provides additional knowledge for determining the correct emotions and sentiment labels. To the best of our knowledge, this is the first dialogue data labeled with multiple emotions, intensity, and sentiment for identifying emotions and sentiments in conversations and will hopefully promote further research in this area. The major contributions of our present work are: • We create a large-scale Multi-label Emotion, Intensity, and Sentiment Dialogue (MEISD) dataset for the task of multiple emotion, intensity, and sentiment classification in conversations.
• We provide some strong baselines for the proposed MEISD dataset for all the three tasks, viz. multilabel emotion classification, intensity prediction, and sentiment analysis on dialogues.
The rest of the paper is structured as follows. In Section 2, we present a brief survey of the related work. In Section 3, we describe the details of the dataset that we create. In Section 4, we explain the methodology. The experimental setup, along with the evaluation metrics, is reported in Section 5. In Section 6, we present the results along with the necessary analysis. Finally, we conclude in Section 7 with future work directions.

Related Work
Most of the early research on emotion classification and sentiment analysis was performed separately upon textual datasets mostly taken from twitter (Agarwal et al., 2011;Socher et al., 2013;Colneriĉ and Demsar, 2018;Ghosal et al., 2018;. In , the authors proposed a RNN framework capable of learning inter-modal interaction among the different modalities using the auto-encoder mechanism. As emotion and sentiment are two very closely related tasks, in recent time there is a trend on modeling both sentiment and emotion of an utterance simultaneously (Akhtar et al., 2019a;Akhtar et al., 2019b;Kumar et al., 2019;Akhtar et al., 2020). In (Akhtar et al., 2020), the authors employed the concept of multi-task learning for multi-modal affect analysis and explored a contextual inter-modal attention framework that aimed in leveraging the association among the neighboring utterances and their multi-modal information. With the advancements in Artificial Intelligence (AI), emotion classification and sentiment analysis have become a significant task due to its importance in many downstream tasks, such as customer behavior modeling, response generation for conversational agents, multimodal interactions etc. Hence, to maximize user satisfaction and providing a better experience to the customer, it is important to understand the correct emotion and sentiment of the customer. Recently, multi-label emotion classification has been investigated for textual data in (Kim et al., 2018;He and Xia, 2018;Yu et al., 2018;Huang et al., 2019). Using multiple Convolution Neural Network (CNN) networks along with self-attention, the authors in (Kim et al., 2018) performed multi-label emotion classification on twitter data. Similarly, the authors in (Yu et al., 2018) improved the performance of multi-label emotion classification on twitter data by using transfer learning. Lately, sequence-to-sequence framework (Huang et al., 2019) has been employed for multi-label emotion classification. Our present work differs from these single and multi-label emotion and sentiment classification works as we tend to classifying emotions and sentiments on dialogue conversations that require contextual information of the previous utterances, thereby making the task more challenging and interesting. Every human-machine interactions are grounded in conversations driven by emotions. Hence, identifying the emotion in dialogue is essential for building robust systems capable of such interactions. Recently, investigations on emotion detection in conversations has been in demand. The authors in (Chen et al., 2018) released a dataset taken from Friends TV series for detecting emotions in dialogues. Similarly, in (Yeh et al., 2019) an attention framework was designed for identifying emotions in spoken dialog systems. In (Hazarika et al., 2018b), memory networks were adopted to capture contextual information for emotion detection in conversations. To capture the contextual information in conversations, Dia-logueRNN  employs three gated recurrent units (GRU) for effectively modeling the past utterances of the speaker and the listener in dyadic conversations for emotion detection.
As conversation itself is multimodal, people involved in conversations use various facial expressions, gestures and different pitch, tones to emote their feelings making the conversation dependent on the audio and visual aspect as well. Hence, quite a few multimodal datasets have been employed to identify emotion using audio and visual information as well. In (Hazarika et al., 2018a), the author proposed an interactive memory network that extracts multimodal features for emotion classification. IEMOCAP dataset (Tripathi and Beigi, 2018) has been used for emotion detection using a deep neural framework that uses the multimodal information at the final layer for emotion identification. Multimodal sentiment analysis has also been investigated for correct classification of sentiments (Poria et al., 2017;Majumder et al., 2018). The authors in (Majumder et al., 2018) proposed a novel hierarchical feature fusion strategy for integrating different modalities, such as audio, video and text for identifying the sentiments. The authors in  extended the EmotionLines dataset by incorporating audio and visual modalities for correct identification of emotions and sentiments in conversations. The MELD dataset has been further used for building different neural frameworks for jointly identifying emotion and sentiment from conversations Zhang et al., 2019b;Zhang et al., 2019a). As opposed to these existing works on multimodal emotion and sentiment classification on dialogue data, our present works provides a balanced multimodal multi-label emotion, intensity and sentiment dataset for the classification of multiple emotions and sentiment in the given utterance.

Multimodal Multi-label Emotion, Intensity and Sentiment Dialogue (MEISD) Dataset
We create the MEISD dataset 1 from the 10 famous TV shows belonging to different genres: (i). Comedy: Friends, The Big Bang Theory, How I Met Your Mother, The Office; (ii). Drama: House M.D., Grey's Anatomy, Castle and Game of Thrones, House of Cards, Breaking Bad. This dataset consists of conversations with utterances from multiple speakers making it a multi-party conversational dataset. The dataset contains dialogues mostly from all the episodes belonging to the different seasons of the TV series giving us a wide variation in dialogues. In total, we have 1000 dialogues from all the TV series in our dataset. Firstly, we obtain the start and end timestamps of every dialogue from the different episodes of the TV series. We extract all the subtitles and transcripts for every dialogue with their respective timestamps. Thereafter, we segment the dialogues into utterances following the heuristics similar to : (i). The timestamps of the utterances belonging to a dialogue should always be in the increasing order; (ii). The utterances in a particular dialogue should be from the same episode only. Utterances in the subtitles were sometimes grouped together under the same timestamp in the subtitle files. Hence, we use the transcription alignment tool Gentle 2 for extracting the accurate timestamp information of every utterance as it automatically aligns the text with the audio by obtaining the word-level timestamp information from the audio file. After extracting the corresponding timestamps of every utterance in a dialogue, we then obtain their audio and visual clips from the source episodes. After getting the audio and visual clips of every utterance, we extract the audio and visual files from these clips. The audio files are then formatted as 16-bit PCM WAV files for further processing. The video files were used to extract 2048D pooled features using the last convolution block of ResNet101. Our final MEISD dataset comprises of textual, visual and audio features that bring the three important modalities together for effective multi-label emotion, intensity and sentiment analysis.

Annotation
The utterances in every dialogue of the MEISD dataset is annotated with the appropriate emotion category and their corresponding intensity. For annotating the dataset, we consider Ekman's (Ekman, 1992) six universal emotions, namely Joy, Sadness, Anger, Fear, Surprise, and Disgust as emotion labels for the utterances in a dialogue. The emotion annotation list has been extended to incorporate two more labels, namely Acceptance and neutral. The "acceptance" emotion has been taken from the Plutchik's (Plutchik, 1980) wheel of emotions for utterances in a dialogue expressing this emotion while the "neutral" label is designated to utterances having no-emotion. Every emotion label is accompanied with an intensity value ranging from 1-3, with 1 indicating the lower intensity and 3 the highest. Every utterance in a given dialogue is labeled with sentiment labels (i.e. positive, negative and neutral) as well.
For annotating the utterances in our dataset, we employ four graduate students highly proficient in English comprehension. The guidelines for annotation along with some examples were explained to the annotators before starting the annotation process. As we create a multimodal dataset, hence the annotation of the dataset was also done in a similar manner. The data was annotated by not just looking at the transcripts (textual information) but also focusing on the audio and visual clips of the corresponding  utterance. Hence, for every utterance, the annotators were asked to watch the video clip and listen to the audio files along with the text for annotating the utterance with the appropriate emotion and sentiment labels. The annotators were also given the contextual information (text, audio and video) for a given utterance for reference so that they are able to provide correct emotion and sentiment labels. Majority voting scheme was used for selecting the final emotions or sentiment label for each utterance. We achieve an overall Fleiss' (Fleiss, 1971) kappa score of 0.67 for the emotions, 0.72 for intensity and 0.75 for sentiment which can be considered as reliable. The use of audio and visual modalities for annotation has helped in achieving the correct emotion labels with the corresponding intensity for every utterance of the dialogue. The utterances for which the annotators could not reach an agreement on the emotions, intensity or sentiment labels were removed from the dataset to avoid any discrepancies in the data. In Table 1, we show the overall emotion and sentiment distribution of our dataset.  Table 3: Examples from the MEISD dataset showing contrasting emotion and sentiment labels for a given utterance As already mentioned, we annotate our dataset with eight emotion labels, i.e. anger, disgust, fear, joy, acceptance, neutral, sadness, and surprise with an intensity range from 1-3 and three sentiment labels i.e. positive, negative and neutral. From the emotion distribution given in Figure 2b, it is evident that the emotion labels are balanced in comparison to the MELD dataset as we have extracted the dialogues from different TV series, hence providing diversity in dialogues. The sentiment labels of the utterances were also annotated along with emotion. The sentiment distribution of both the datasets is given in Figure 2a.
As already mentioned the authors in  labeled the utterances with single emotion while losing the information of other possible emotions present in the given utterance. Also, the authors in  labeled every utterance with sentiments based on their emotion labels. Positive sentiment label was given to the utterances having joy as the emotion label and negative sentiment was labeled to the utterances having anger, disgust, sadness, fear as emotion labels. While they only annotated the surprise emotion label with sentiments having positive and negative sentiment labels as this emotion is considered to fall on either of the sentiment labels. Hence, we take care of the fact that the sentiment is annotated independently without being biased on the emotion label. From the example, given in Table 3, we can see that the sentiment label and emotions are independent at times, whereas a positive sentiment label can be given to negative emotion and vice versa. Hence, in preparing our MEISD dataset, we have taken care of these details as sentiment or emotion is dependent on the contextual information and the speaker of the utterance. In Table 2, we provide the important statistics of the MEISD dataset. The average duration of an utterance in our dataset is approximately 4 seconds. The average length of an utterance in a dialogue across the training, validation and test sets are almost the same. The average dialogue length comprises of 20 utterances and it is the same across the training, validation and test splits. Every dialogue on an average consists of five emotions while the average number of emotions in a given utterance is 2. The presence of multiple speakers and the emotion shift of a speaker makes the task of emotion and sentiment analysis very interesting as well as challenging. In Figure 3, we show the emotion shift of a speaker as the dialogue grows. Figure 3: A dialogue from the MEISD dataset showcasing the emotion shift as the conversation grows. The text in blue represents the sentiment label while the text in red represents the emotion label of every utterance.

Comparison with Related Datasets
The available datasets for multimodal emotion detection and sentiment classification are nonconversational. The examples of such datasets are MOUD (Pérez-Rosas et al., 2013), MOSI (Zadeh et al., 2016) and MOSEI  that have been deeply investigated by the researchers for both the tasks. Two dyadic conversational datasets, IEMOCAP (Busso et al., 2008) and SEMAINE (McKeown et al., 2011) have gained popularity for encouraging research on emotion detection for conversations. Recently, MELD  dataset was released to inspire research on multiparty conversations using information from different modalities.
IEMOCAP Dataset: The IEMOCAP (Interactive Emotional Dyadic Motion Capture Database) dataset (Busso et al., 2008) comprises of videos of dyadic interactions between pairs of 10 speakers across a duration of 10 hours having different dialog situations. The utterances are extracted by segmenting the videos and then labeling each utterance with fine-grained emotion labels, such as anger, excitement, happiness, frustration, neutral, and sadness. The dataset also gives continuous attributes in the form of valence, activation, and dominance for facilitating better emotion detection of the utterances. Our MEISD dataset differs majorly from this dataset as ours is labeled with multiple emotions, intensity and sentiment categories to jointly perform both sentiment and emotion tasks. SEMAINE Dataset: The SEMAINE dataset (McKeown et al., 2011) is an audiovisual database designed to engage a person in a continuous and emotional conversation. The conversations in the dataset comprise interactions concerning a human and an operator (where it can be either a person or a person simulating a machine). In total, there are 150 participants in the dataset, having 959 conversations, where each conversation having a duration of about 5 minutes. This dataset is different from our proposed dataset as we provide multiparty conversations labeled with both sentiment and emotion labels.
MELD Dataset: The Multimodal EmotionLines Dataset (MELD)  comprises of multiparty conversations taken from the Friends TV series. The dataset has been annotated with 7 emotion labels, namely anger, fear, disgust, surprise, neutral, sadness, and joy. The dataset has also been annotated with three sentiment labels i.e., positive, negative and neutral. The dataset comprises of 13,000 utterances having textual, audio and visual information, hence facilitating multimodal research for emotion and sentiment in multiparty conversations.
Our proposed dataset, though having multiparty conversations with multimodal information is different from the MELD dataset. The dataset that we present here is larger compared to MELD. The major difference being that we provide multi-label emotion information with the corresponding intensity for the utterances in a dialogue. Our emotion labels are balanced in comparison to the MELD dataset, since we have taken conversations from different TV series. By using different TV series belonging to different genres, we provide diversity in our dataset. Hence, every emotion is depicted by various characters that bring diverseness in the way a particular emotion is expressed making the task exciting as well as challenging. Comparisons between the existing datasets and our proposed MEISD dataset are given in Table 4.

Experiments
The extraction of features along with the details of the baseline models to evaluate our proposed MEISD dataset is described in this section. We also discuss the metrics used to evaluate the models on the proposed dataset.

Feature Extraction
Textual Features: For textual features, we take the pre-trained 300-dimensional GloVe embeddings (Pennington et al., 2014) of every word as features.
Audio Features: We encode audio tracks with the pre-trained VGGish network (Hershey et al., 2017), which is trained on Audioset (Gemmeke et al., 2017) consisting of 100 million YouTube videos. It has been shown to improve the audio emotion and sentiment classification. We extract audio features of dimension 128 from the last fully connected layer.
Visual Features: Due to computational cost, we only consider the middle frame of the video to extract visual feature V k . We use 2048-dimension pooled features from the last block of Resnet-101 (He et al., 2016) pre-trained on Imagenet (Russakovsky et al., 2014) for visual features.
The bimodal or the multimodal features are obtained by concatenating the respective audio, visual and textual features as needed in the model.

Baseline Models
In order to provide strong baselines for our MEISD dataset, we perform several experiments with different baselines. We extend the existing baselines for multi-label emotion and intensity prediction. We model multi-label emotion, sentiment as the classification; and intensity prediction as the regression task, respectively. All the implementations are done using the PyTorch 3 framework. Based on the validation set, we set the threshold value of 0.2 for the classification of multiple emotions in a given utterance. For all the baselines, in the final output layer we apply softmax activation function for emotion and sentiment classification while we apply sigmoid activation function for intensity prediction.
text-CNN: In this approach, we only use the textual information for identifying the emotion and sentiment of every utterance in a dialogue. In this framework, we use the word embeddings of the utterances as input to the convolutional neural network (CNN) (Kim, 2014) for obtaining the sentence representation. In this model, we do not use the contextual information or the additional information from the different modalities for identifying the emotion or sentiment of an utterance.
bcLSTM: This baseline employing bi-directional RNN for capturing the contextual information was proposed by (Poria et al., 2017). It employs a two-step hierarchical mechanism that captures the unimodal context first followed by the bi-modal context features. In this methodology, we incorporate the provision of capturing information from all the three modalities. A CNN-LSTM approach is used for unimodal text to extract the textual features using the Glove embeddings as input to the model. For audio representations, we employ a LSTM with every audio feature vector as input to the model. Similarly, for video representations, we employ a LSTM model giving the visual feature vector as the input. Finally, the representations from the unimodal are fed as input to the multimodal framework for identifying the corresponding emotion, sentiment and intensity of the utterance.
DialogueRNN: This baseline proposed by  is one of the current state-ofthe-art approaches for modelling emotions and sentiments in conversations. It is a powerful baseline for modeling context with effective mechanisms by tracking individual speaker states throughout the dialogue for correct emotion and sentiment classification. Since DialogueRNN can handle multi-party interactions, hence it can be applied directly to our proposed MEISD dataset. It utilizes three levels of gated recurrent units (GRU) to model conversational context for correctly identifying the emotions, intensity and sentiments in a dialogue.
DialogueRNN + BERT: We propose a stronger baseline built upon the DialogueRNN for correct classification of emotion and sentiment, and for intensity prediction. We are able to improve the performance of DialogueRNN by using BERT (Devlin et al., 2018) embedding instead of Glove embedding to represent the textual features.

Result and Discussion
In this section, we provide the results for all the three tasks, i.e. multi-label emotion classification, intensity prediction and sentiment analysis on our proposed MEISD dataset. In Table 5, we provide the results of all the three tasks for all the different baselines.From the results, it is evident that we achieve a weighted overall F1 score of 62.29% using our proposed baseline which has been built upon the DialogueRNN. We have used BERT representations as the textual features which help in improving the performance of the model by increasing the F1 scores in case of multi-label emotion classification. In case of Jaccard index which is equivalent to multi-label accuracy, we see an improvement in the proposed baseline with an accuracy of 53.7%. Lower hamming loss in the proposed baseline indicates the better performance of the model for the given task.
From the table, we can also infer that using solely the audio and video features of every utterance decreases the performance of the model in identifying the correct emotions. The major information about the emotions is achieved from the textual features itself, hence the performance of the models using only textual features is far better than the models having only audio and video features as input. While using all the features, they together boost the performance of the model. Hence, it can be concluded that the audio and visual counterparts of an utterance assist in identifying the correct emotions of a particular utterance. Since in our final baseline model we only enhance the performance by using better textual representation, hence the performance on audio and visual are similar to the DialogueRNN baseline. For the intensity prediction task, we report the Pearson correlation co-efficient as a metric and from the table it is visible that the final proposed baseline yields the highest score of 0.588 using information from all the three modalities.
Simultaneously, in Table 5 we present the results of sentiment classification on our dataset for the several baselines as mentioned in the previous section. Overall, we achieve F1 score of 69.25% from our DialogueRNN + BERT based baseline model. Even in the case of sentiment, we see that BERT helps in improving the overall performance of the individual sentiment labels, thereby enhancing the F1 score of the model. As almost all the sentiment labels are in equal proportion, hence the performance of each label is almost the same with respect to each other.

Conclusion and Future Work
In this paper, we have introduced a large-scale multimodal multiparty conversational dataset, MEISD for multi-label emotion classification, intensity prediction and sentiment analysis in conversations. The detailed description of the dataset along with the entire process for building the dataset has been discussed in the paper. MEISD dataset is a multimodal dataset that has textual, audio and visual features for every utterance of dialogue taken from 10 different TV series belonging to the different genres, thereby providing a large diversity to the MEISD dataset. Hence, this dataset provides diversity with respect to utterances, scene information, characters and emotional expressions, and hence offer a wide variety in dialogues and make the task all the more challenging. We have evaluated our proposed MEISD datasets and reported the results using strong baselines for all the three tasks of emotion recognition, intensity prediction and sentiment classification. We believe that this dataset can be employed in the future for multi-label emotion, intensity and sentiment detection in conversations.
In the future, this dataset can be employed for multi-task learning of all the three tasks simultaneously in dialogues. As all the three tasks are closely related, hence through multi-task learning the performance of the tasks might improve due to the shared information. This dataset can also be used for building emotional and sentimental conversational agents. Furthermore, the multimodality aspect of the dataset can be investigated deeply employing different fusion techniques for achieving better multi-modal interactions that can help in the tasks. Also, research on novel frameworks for capturing the contextual information for better classification of all the tasks can be investigated in the future.