MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

Emotion recognition in conversations is a challenging task that has recently gained popularity due to its potential applications. Until now, however, a large-scale multimodal multi-party emotional conversational database containing more than two speakers per dialogue was missing. Thus, we propose the Multimodal EmotionLines Dataset (MELD), an extension and enhancement of EmotionLines. MELD contains about 13,000 utterances from 1,433 dialogues from the TV-series Friends. Each utterance is annotated with emotion and sentiment labels, and encompasses audio, visual and textual modalities. We propose several strong multimodal baselines and show the importance of contextual and multimodal information for emotion recognition in conversations. The full dataset is available for use at http://affective-meld.github.io.


Introduction
Multimodal data analysis exploits information from multiple-parallel data channels for decision making. With the rapid growth of AI, multimodal emotion recognition has gained a major research interest, primarily due to its potential applications in many challenging tasks, such as dialogue generation, multimodal interaction, or conversational emotion recognition and generation.
Although there is significant work carried out on multimodal emotion recognition (Poria et al., 2017a;Zadeh et al., 2016a;Wollmer et al., 2013) using audio, visual, and text modalities, only very few actually focus on understanding emotions in conversations. One main reason for this is the lack of a large multimodal conversational dataset. Recently, Hazarika et al. (2018) proposed a multimodal memory network that can recognize emotion in dyadic dialogues. However, their work is limited only to dyadic conversation understanding, and thus it is not scalable to emotion recognition in multi-party conversations having more than two participants.
In this paper, we extend, improve, and further develop the EmotionLines dataset (Chen et al., 2018) for the multimodal scenario. Our dataset, called Multimodal EmotionLines (MELD), includes not only textual dialogues, but also their corresponding visual and audio counterparts. MELD contains more than 13,000 utterances, which makes our dataset nearly two times larger than existing multimodal conversational datasets such as IEMOCAP Busso et al. (2008) or SE-MAINE (McKeown et al., 2012). Moreover, while both IEMOCAP and SEMAINE are dyadic in nature, MELD contains multi-party conversations that are often more challenging to classify, while at the same time more suitable for developing multimodal affective dialogue systems.
The MELD dataset allows for the exploration of both contextual emotion recogntion as well as multimodal emotion recognition.
Contextual Emotion Recognition. In a conversation, the participants utter an utterance mostly depending on the context of the conversation. Hence, the emotion expressed by the utterances in a conversation also depend on the context of the conversation. In particular, we can think of conversational context as a set of parameters that influence a person to utter an utterance with an emotion. With the major recent research interests in dialogue systems, studies have been carried out to approach context modeling using different techniques for using e.g., memory networks and RNNs (Hazarika et al., 2018;Poria et al., 2017b;Serban et al., 2017).
As an example, we illustrate We show the role of context in Figure 1 where both the speakers change their emotion as the conversation continues depending on each other's utterance and expressed emotions. 1 While modeling context in a conversation, complex inter-speaker relations are one of the major challenges (Hazarika et al., 2018). Hazarika et al. (2018) claims that it is not enough to just use an LSTM or any other network that takes all the previous utterances as input and generates a vector to represent the context. According to them, a conversational model should know the speaker of each utterance and they experimentally showed that this helps in producing better context representation relevant to emotion recognition by means of interspeaker dependency modeling. Their model dynamically attends to the history of utterances by the same speaker or the other speaker for emotion recognition.
Multimodal Emotion Recognition. Conversation in its simplest and most natural form is multimodal. We try to rely on the others' facial expression, vocal tone, language, gestures, and so on, while participating in a conversation. It helps us to better understand the stance of other participants in the conversation. As far as emotion recognition in a conversation is concerned, multimodality plays a key role. For example, if the language is confusing to perceive the expressed emotion, we often rely on the the vocal tone and facial expression.
There are several other challenges involved in multimodal emotion recognition of sequential turns and the classification of short utterances is one of them. Utterances like "yeah", "okay", "no" can express different emotions depending on the context and discourse of the dialogue. The emotion change and emotion flow in the sequence of turns in a dialogue can make context modeling difficult. In this dataset, as we have access to the multimodal data sources for each dialogue, we hypothesize that it will improve the context representation, supplement missing or misleading information from other modalities, thus benefiting the overall emotion recognition performance. As in the previous example of short utterances, they typically do not express any explicit emotion by themselves, but the speaker's facial expressions or intonation in speech could carry important clues for classifying such utterances as non-neutral.
Hence, in order to create a conversational AI for emotion recognition or other purposes, it is crucial to utilize both the contextual and multimodal information. The publicly available datasets for multimodal emotion recognition in conversations -IEMOCAP and SEMAINE -have some limitations, primarily having to do with the relatively small number of utterances and dialogues present in these two datasets. The other publicly available multimodal emotion and sentiment recognition datasets are MOSEI (Zadeh et al., 2018b), MOSI (Zadeh et al., 2016b), and MOUD (Pérez-Rosas et al., 2013), however none of these datasets is conversational. On the other hand, Emotion-Lines (Chen et al., 2018) is a dataset that contains dialogues from the Friends TV series where more than two speakers participate in a dialogue. Emo-tionLines can be used as a resource for emotion recognition for text only, as it does not include data from other modalities such as the visual and audio streams.
To test the role of context and multimodality in emotion recognition, this paper also introduces a strong baseline following the method of Poria et al. (2017b), which represents context using a RNN. Baseline results show that both context representation and multimodality help improve the performance over non-contextual or unimodal systems.
The rest of the paper is organized as follows. Section 2 presents an overview of the related datasets; Section 3 discusses the EmotionLines dataset; we present MELD in Section 3.1; strong baseline and experiments are elaborated in Section 5; future directions and applications of MELD are covered in Section 6 and 7 respectively; finally Section 8 concludes the paper.

Related Datasets
Most of the available datasets in multimodal sentiment analysis and emotion recognition are   MOSI (Hazarika et al., 2018;Zadeh et al., 2017), MOSEI (Zadeh et al., 2018a,b), MOUD (Pérez-Rosas et al., 2013) are such non-conversational datasets which have drawn major research interests. On the other hand, IEMOCAP and SEMAINE are the dyadic conversational datasets where each utterance in a dialogue is labeled by emotion. As these two datasets are similar to MELD, we limit the scope of this section to only IEMOCAP and SEMAINE.
The SEMAINE Database was developed by McKeown et al. (2012). It is a large audiovisual database created for building agents that can engage a person in a sustained and emotional conversation using Sensitive Artificial Listener (SAL) (Douglas-Cowie et al., 2008) paradigm. SAL is an interaction involving two parties: a 'human' and an 'operator' (either a machine or a person simulating a machine). The interaction is based on two qualities: one is low sensitivity to preceding verbal context (the words the user used that do not dictate whether to continue the conversation) and the second is conduciveness (response to a phrase by continuing the conversation). There were 150 participants, 959 conversations, each lasting 5 minutes. There were 6-8 annotators per clip, who eventually traced 5 affective dimensions and 27 associated categories. For the recordings, the participants were asked to talk in turn to four emotionally stereotyped characters. The characters are Prudence, who is even-tempered and sen-sible; Poppy, who is happy and outgoing; Spike, who is angry and confrontational; and Obadiah, who is sad and depressive. Videos were recorded at 49.979 frames per second at a spatial resolution of 780 x 580 pixels and 8 bits per sample, while audio was recorded at 48 kHz with 24 bits per sample. To accommodate research in audio-visual fusion, the audio and video signals were synchronized with an accuracy of 25 micro-seconds.
The Interactive Emotional Dyadic Motion Capture Database (IEMOCAP) dataset was developed by Busso et al. (2008). Ten actors were asked to record their facial expressions in front of cameras. Facial markers, head and hand gesture trackers were placed in order to collect the facial expressions, head and hand gestures. In particular, the dataset contains a total of 10 hours of recordings of dyadic sessions. Each recording of the dataset expresses either of these emotions -happiness, anger, sadness, frustration and neutral state. The recorded dyadic sessions were later manually segmented at the utterance level, defined as continuous segments with one of the actors actively speaking. The acting was based on some scripts, hence it was easy to segment the dialogues for utterance detection in the textual part of the recordings. Busso et al. (2008) used two famous emotion taxonomies in order to manually label the dataset in utterance level: discrete categorical based annotations (i.e., labels such as happiness, anger, and sadness), and con-tinuous attribute based annotations (i.e., activation, valence, and dominance). To assess the emotion categories of the recordings, six human annotators were appointed. Having two different annotation schemes can provide complementary information in human-machine interaction system. The evaluation sessions were organized so that three different annotators assessed each utterance. Self-assessment manikins (SAMs) were also employed to evaluate the corpus in terms of the attributes valence [1-negative, 5-positive], activation [1-calm, 5-excited], and dominance [1-weak, 5strong]. Two more human annotators were asked to estimate the emotional content in recordings using the SAM system. These two types of emotional descriptors facilitate the complementary insights about the emotional expressions of humans, emotional communications between people which can further potentially help to develop a better human-machine interfaces by automatically recognizing and synthesizing the emotional cues expressed by humans.
MELD is different from these two datasets in terms of both complexity and quantity. Both IEMOCAP (Busso et al., 2008) and SEMAINE (Schuller et al., 2012) contain only dyadic conversations wherein the dialogues in MELD are multiparty. Multiparty conversations are more challenging in comparison to dyadic. MELD has more than 13000 emotion labeled utterances which is almost double of the annotated utterances present in both IEMOCAP and SEMAINE. In Table 10 we present a comparison among MELD, IEMO-CAP and SEMAINE. We discuss this comparison in more detail in Comparison with the Related Datasets section.

EmotionLines Dataset
The EmotionLines dataset was developed by Chen et al. (2018). This dataset contains dialogues from the sitcom Friends, where each dialogue contains utterances from multiple speakers. Chen et al. (2018) crawled the dialogues from each episode and grouped them into four groups ([5, 9], [10,14], [15,19], and [20,24]) based on the number of utterances present in the dialogues. Finally, 250 dialogues were sampled randomly from each of these groups, resulting in the final dataset of 1,000 dialogues.

Annotation
The utterances in each dialogue were annotated with the most appropriate emotion category. Chen et al. (2018) considered Ekman's six emotions, i.e., Joy, Sadness, Fear, Anger, Surprise, and Disgust as annotation labels. This annotation list was extended with an additional emotion label Neutral. They used Amazon Mechanical Turk (AMT) to annotate the utterances. The authors used five Mturkers for the annotation. Majority voting scheme was applied in order to select a final emotion label for each utterance. The overall kappa score of this annotation process was 0.34.

Multimodal EmotionLines Dataset (MELD)
We further expand the EmotionLines dataset into a multimodal dataset. Below are the steps that were taken to construct the dataset: 1. The first step deals with finding the timestamp of every utterance in each of the dialogues present in the EmotionLines dataset.
To accomplish this, we crawled through the subtitle files of all the episodes that contain the beginning and the end timestamp of the utterances. This process enabled us to obtain season ID, episode ID, and timestamp of each utterance in the episode. We put two constraints while obtaining the timestamps: (a) timestamps of the utterances in a dialogue must be in increasing order, (b) all the utterances in a dialogue have to belong to the same episode and scene.
Constraining with these two conditions revealed that in EmotionLines a few dialogues consist of multiple natural dialogues, and we decided to filter out such instances. One such example from EmotionLines is shown in Table 2. The dialogue in Table 2 contains two natural dialogues from episode 4 and 20 of season 6 and 5 respectively. Because of this error correction step, MELD has a different number of dialogues as compared to the EmotionLines.
2. We asked three annotators to label each utterance in a dialogue. Majority voting was applied to decide the final label of the utterances. A few utterances were removed that did not have a majority annotators' agreement. To this end, dialogues containing these utterances were also removed to maintain the flow of the dialogues. There was a total of 89 such utterances spanning 11 dialogues.
3. After obtaining the timestamp of each utterance, we extracted their corresponding audiovisual clips from the source episode. Separately, we also extracted the audio content from these video clips.

Dataset Re-annotation
The utterances in the original EmotionLines dataset were annotated by only looking at the textual part of the utterances. However, as our focus is to develop a multimodal version of the Emotion-Lines dataset, we re-annotate all the utterances by asking three annotators to also look at the available video clip of the utterances. A majority voting technique was used to obtain the final label from the three annotations for each utterance. The Fleiss's kappa score of this annotation process was 0.43 which is higher than the kappa of the original EmotionLines annotation, thus suggesting the usefulness of the additional modalities during the annotation process. Table 4 shows the label-wise comparison between the original EmotionLines and the re-annotated MELD dataset. For most of the utterances, annotations in MELD match with the original EmotionLines' annotations. When asked, annotators confirmed that the video clips of the utterances helped them in the annotation. One such utterance is "This guy fell asleep!" (as shown in Table 3). This utterance has been labeled as non-neutral in EmotionLines but thanks to the available video clip, it has been labeled as anger in MELD. Manually looking at the video clip of this utterance reveals that a very angry and frustrated facial expression along with a high vocal tone are key to recognize its correct emotion. We thus believe that the surrounding contextual utterances of it were not sufficient for the EmotionLines' annotators to label this utterance correctly. To this end, this example justifies that both context and multimodality are important aspects for emotion recognition in dialogue or conversation in general.

Dataset Pruning
There are many utterances in the subtitles that are grouped within identical timestamps in the subtitle files. In order to find the accurate timestamp for each utterance, we use a transcription alignment tool Gentle, 2 which automatically aligns a transcript with the audio by extracting word-level timestamps from the audio (Table 5). In Table 6, we show the format of the MELD dataset.

Dataset Exploration
As mentioned before, we use seven emotions for the annotation, i.e., anger, disgust, fear, joy, neutral, sadness, and surprise. We present the emotion distribution in training, development, and test datasets in Table 7. It can be seen that the emotion distribution in the dataset is not uniform and the majority of utterances are labeled as neutral.
We have also converted these fine grained emotion labels into more coarse grained sentiment classes by considering anger, disgust, fear, sadness as negative, joy as positive, and neutral as neutral sentiment bearing class. Surprise is an example of a complex emotion which can be expressed with both positive and negative sentiment. The three annotators who performed the utterance annotation were further asked to annotate the surprise emotion bearing utterances into either positive or negative sentiment classes. The entire sentiment annotation task had a Fleiss' kappa score of 0.91. The distribution of positive, negative, neutral sentiment classes is given in Table 8. Table 9 presents several key statistics of the dataset. We can see that the average utterance length, in terms of the number of words present in an utterance, is almost the same across training, development, and test datasets. On average, three emotions are present in a dialogue of the dataset. The average duration of an utterance is 3.59 seconds. Emotion shift of a speaker in a dialogue makes emotion recognition task very challenging.    We observe that the number of such emotion shift in successive utterances of a speaker in a dialogue is very frequent in MELD -4003, 427, and 1003 in training, development, and test datasets respectively. Figure 1 shows an example where speaker's emotion changes with time in the dialogue.

Comparison with the Related Datasets
In this section, we compare our proposed MELD dataset with other databases. Particularly, we select two datasets, IEMOCAP 3 (Busso et al., 2008) and SEMAINE 4 (Schuller et al., 2012), which are extensively used in this field of research and contain settings which are aligned to the components Table 10 provides information on the number of available dialogues and their constituent utterances for all three datasets, i.e., IEMOCAP, SE-MAINE, and MELD. As seen in the table, MELD contains the largest size of dialogues (and utterances) which is significantly more than the other two. Figure 2 also indicates this trend for common emotions between IEMOCAP and MELD. Except for sadness, MELD contains a higher amount of instances pertaining to the respective emotional categories. The extremeness of available neutral utterances in MELD emulates real-life conversation trends where the prevailing emotion is generally neutral. Another key difference for MELD is that it contains multi-party dialogues whereas IEMOCAP and SEMAINE are datasets comprising dyadic interactions only. This provides a natural setting for dialogues where multiple speakers can engage and demands proposed dialogue mod-     In this section, we discuss the method of feature extraction for three different modalities: audio, video, and text. We have followed the contextual multimodal sentiment analysis approach, proposed by Poria et al. (2017b) to get the baseline results on MELD.

Textual Feature Extraction
The textual data is obtained from the transcripts of the videos. We apply a deep Convolutional Neural Networks (CNN) (Karpathy et al., 2014) on each utterance to extract textual features. Each utterance in the text is represented as an array of pre-trained 300-dimensional Glove vectors (Pennington et al., 2014). Further, the utterances are truncated or padded with null vectors to have exactly 50 words. Next, these utterances as an array of vectors are passed through two different convolutional layers; the first layer having two filters of size 3 and 4 respectively with 50 feature maps, each and the second layer has a filter of size 2 with 100 feature maps. Each convolutional layer is followed by a max-pooling layer with window 2 × 2.
The output of the second max-pooling layer is fed to a fully-connected layer with 500 neurons with a rectified linear unit (ReLU) (Teh and Hinton, 2001) activation, followed by softmax output. The output of the penultimate fully-connected layer is used as the textual feature. The translation of convolution filter over makes the CNN learn abstract features and with each subsequent layer, the context of the features expands further.

Audio Feature Extraction
The audio feature extraction process is performed at 30 Hz frame rate with 100 ms sliding window.
We use openSMILE (Eyben et al., 2010), which is capable of automatic pitch and voice intensity extraction, for audio feature extraction. Prior to feature extraction, audio signals are processed with voice intensity thresholding and voice normalization. Specifically, we use Z-standardization for voice normalization. In order to filter out audio segments without the voice, we threshold voice intensity. OpenSMILE is used to perform both these steps. Using openSMILE we extract several Low-Level Descriptors (LLD) (e.g., pitch, voice intensity) and various statistical functionals of them (e.g., amplitude mean, arithmetic mean, root quadratic mean, standard deviation, flatness, skewness, kurtosis, quartiles, inter-quartile ranges, and linear regression slope). "IS13-ComParE" configuration file of openSMILE is used to for our purposes. Finally, we extracted total 6373 features from each input audio segment.

Context Modeling
Utterances in the videos are semantically dependent on each other. In other words, the complete meaning of an utterance may be determined by taking preceding utterances into consideration. We call this the context of an utterance. Following (Poria et al., 2017b), we use RNN, specifically, GRU 5 to model semantic dependency among the utterances in a video. Let the following items represent unimodal features: where N = maximum number of utterances in a video. We pad the shorter videos with dummy utterances represented by null vectors of corresponding length. For each modality, we feed 5 LSTM does not perform well and s mt ∈ R Dm . This yields hidden outputs F mt as context-aware unimodal features for each modality. Hence, we define F m = GRU m (f m ), where F m ∈ R N ×Dm . Thus, the context-aware unimodal features can be defined as

Fusion
We then fuse F A , F T to a multimodal feature space. In order to get the fused representation of the modalities, F AT , we simply concatenated F A and F T by following Poria et al. (2017b). The   (Zadeh et al., 2017) are left for future work.
Finally f AT was fed to contextual GRU i.e., GRU AT which incorporates the contextual information contributed by the utterances.

Classification and Training
The training of this network is performed using categorical cross-entropy on each utterance's softmax output per dialogue, i.e., where M = total number of dialogues in the dataset, L i = number of utterances for i th dialogue, y j i,c = original output of class c, andŷ j i,c = predicted output for j th utterance of i th dialogue.
As a regularization method, dropout between the GRU cell and dense layer is introduced to avoid overfitting. We used Adam (Kingma and Ba, 2014) as an optimizer. We use development data to tune the hyperparameters. Early stopping with patience 10 was used in the training.

Baseline Results
In Table 11 and 12, we show the baseline results following the method explained in Section Strong Baseline. As it can be seen, multimodality outperforms (66.68% f-score) the text and audio modality. However, the improvement due to the fusion is only 0.3% higher than the textual modality which suggests the need for a better fusion mechanism. We left that for the future work. Textual modality outperformed audio modality by more than 17% which indicates the importance of spoken language in sentiment analysis. For positive sentiment category, audio modality performs poor. It would be interesting to analyze the clues specific to positive sentiment bearing utterances in MELD that audio modality could not capture. In future, we will use more advance state-of-the-art audio feature extractor in order to improve the classification performance.  In the case of emotion classification, the performance is poor to classify disgust, fear and sadness emotions. We think, this has happened as the number of training instances for disgust, fear and sadness are very low as shown in Table 7. Nevertheless, this performance acts as a baseline and future works should aim at outperforming this baseline. We observed high mis-classification rate for anger, disgust and fear emotion categories since these emotions have a very subtle difference among them which causing harder disambiguation. Overall, emotion classification results are worse than that of sentiment classification. This observation is expected as emotion classification deals with classification into more classes. Similar to sentiment classification, textual classifier outperformed (56.44% f-score) audio classifier (39.08%) by more than 27%. Multimodal fusion helps in improving emotion recognition performance by 2.81%. However, multimodal classifier performs worse than textual classifier in classifying sadness emotions. All unimodal and multimodal classifiers have done very bad to classify disgust and fear emotions which we suspect was caused because of very less available training data for these two emotions.
Role of Context One of the main purpose of MELD is to build an AI that utilizes context in a conversation for emotion recognition. We have discussed in the introduction on the importance of contextual information in conversation. Table 11 and 12 show that the improvement over non-contextual models for e.g., text-CNN which  only uses a CNN (see Section 5.1.1) is 1.4% to 2.5%. Significant improvement was also observed for audio modality.

Future Directions
There are a number of interesting future directions of this work.
• First, the proposed baselines do not consider the presence of multiple speakers in a conversation. We think that speaker specific utterance encoding can enhance the quality of the context representation, which can in turn improve the performance of emotion recognition.
• Another future direction includes the extraction of visual features. As a part of the dataset, we have released the raw videos and audios which will facilitate the feature extraction process. To this end, baseline audio features did not help much to improve the baseline performance. Enhanced audio feature extraction is also a significant future research direction.
• We have only used concatenation for audio and textual feature fusion. As the experimental results show, multimodal baseline outperforms unimodal baselines by only 0.3-1%. This again justifies the need of using other superior fusion such as Tensor Fusion (Zadeh et al., 2017(Zadeh et al., , 2018b.

Applications of this dataset
The use cases of this dataset are as follows: • As we discussed before, this dataset is useful to train a conversational emotion recognition classifier which can be plugged into any dialogue system to generate empathetic responses similar to Zhou et al. (2017). For example, this dataset can be used for emotion modeling of the users in Twitter persona dataset (Li et al., 2016). As this dataset is multimodal, it is also possible to integrate it with a multimodal dialogue system.
• This dataset should not be used to train an end-to-end dialogue system because of its size (see Table 1). The training set of this dataset contains only 9989 utterances, which is not enough to train a well performing dialogue system. However, the mechanism of constructing this dataset can be easily applied to develop a multimodal dialogue dataset based on Friends or any other TV series such as Breaking Bad. We define multimodal dialogue system as a platform where the system has access to the speaker's voice, the facial expression which it exploits to generate responses. Multimodal dialogue systems can be very useful for real time personal assistants such as Siri, Google Assistant where the users can use both voice and text to communicate with the assistant.

Conclusion
In this work, we propose a multimodal multiparty conversational emotion recognition dataset called MELD. MELD has been developed based on the original EmotionLines dataset. We also provide solid baseline results on MELD. This dataset is publicly available 6 and contains the raw videos and audios which will be useful to extract new audio and visual features. Along with these, we have also released the features used in our baseline experiments. We think this dataset will also be useful as a training corpus for multimodal emphatic response generation. MELD has a strong potential to help conversation understanding research. Future works on this dataset should focus on extract-ing new features and outperforming the baseline results as presented in this paper. The MELD dataset is publicly available for research purposes at https: //affective-meld.github.io/.