Exploiting Unsupervised Data for Emotion Recognition in Conversations

Emotion Recognition in Conversations (ERC) aims to predict the emotional state of speakers in conversations, which is essentially a text classification task. Unlike the sentence-level text classification problem, the available supervised data for the ERC task is limited, which potentially prevents the models from playing their maximum effect. In this paper, we propose a novel approach to leverage unsupervised conversation data, which is more accessible. Specifically, we propose the Conversation Completion (ConvCom) task, which attempts to select the correct answer from candidate answers to fill a masked utterance in a conversation. Then, we Pre-train a basic COntext-Dependent Encoder (Pre-CODE) on the ConvCom task. Finally, we fine-tune the Pre-CODE on the datasets of ERC. Experimental results demonstrate that pre-training on unsupervised data achieves significant improvement of performance on the ERC datasets, particularly on the minority emotion classes.


Introduction
Emotion recognition in conversations (ERC) has garnered attention recently (Poria et al., 2019), due to its potential in developing practical chatting machines (Zhou et al., 2018a). Unlike traditional text classification that handles context-free sentences, ERC aims to predict the emotional state of each utterance in a conversation (Figure 1). The inherent hierarchical structure of a conversation, i.e., wordsto-utterance and utterances-to-conversation, determines that the ERC task should be better addressed by context-dependent models (Poria et al., 2017;Hazarika et al., 2018b;Jiao et al., 2019Jiao et al., , 2020.
Despite the remarkable success, contextdependent models suffer from the data scarcity 1 The source code is available at https://github. com/wxjiao/Pre-CODE You sprayed my front twice! You never turned?
No! I barely even got to three Mississippi.
Mississippi? I said count to five. [Angry] [Surprised] [Angry] [Neutral] issue. In the ERC task, annotators are required to recognize either obvious or subtle difference between emotions, and tag the instance with a specific emotion label, such that supervised data with human annotations are very costly to collect. In addition, existing datasets for ERC (Busso et al., 2008;Zahiri and Choi, 2018;Zadeh et al., 2018) contain inadequate conversations, which prevent the context-dependent models from playing their maximum effect.
In this paper, we aim to tackle the data scarcity issue of ERC by exploiting the unsupervised data. Specifically, we propose the Conversation Completion (ConvCom) task based on unsupervised conversation data, which attempts to select the correct answer from candidate answers to fill a masked utterance in a conversation. Then, on the proposed ConvCom task, we Pre-train a basic COntext-Dependent Encoder (PRE-CODE). The hierarchical structure of the context-dependent encoder makes our work different from those that focus on universal sentence encoders (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019). Finally, we fine-tune the PRE-CODE on five datasets of the ERC task. Experimental results show that the fine-tuned PRE-CODE achieves significant improvement of performance over the baselines, particularly on minority emotion classes, demonstrating the effectiveness of our approach.   Our contributions of this work are as follows: (1) We propose the conversation completion task for the context-dependent encoder to learn from unsupervised conversation data. (2) We fine-tune the pre-trained context-dependent encoder on the datasets of ERC and achieve significant improvement of performance over the baselines.
2 Pre-training Strategy 2.1 Approach ConvCom Task. We exploit the self-supervision signal in conversations to construct our pre-training task. Formally, given a conversation, U = {u 1 , u 2 , · · · , u L }, we mask a target utterance u l as U\u l = {· · · , u l−1 , [mask], u l+1 , · · · } to create a question, and try to retrieve the correct utterance u l from the whole training corpus. The choice of filling the mask involves countless possible utterances, making it infeasible to formulate the task into a multi-label classification task with softmax. We instead simplify the task into a response selection task (Tong et al., 2017) using negative sampling (Mikolov et al., 2013), which is a variant of noise-contrastive estimation (NCE, Gutmann and Hyvärinen, 2010). To achieve so, we sample N − 1 noise utterances elsewhere, along with the target utterance, to form a set of N candidate answers. Then the goal is to select the correct answer, i.e., u l , from the candidate answers to fill the mask, conditioned on the context utterances. We term this task "Conversation Completion", abbreviated as ConvCom. Figure 2 shows an example, where the utterance u4 is masked out from the original conversation and the candidate answers include u4 and two noise utterances.
Context-Dependent Encoder. The contextdependent encoder consists of two parts: an utterance encoder, and a conversation encoder. Each utterance is represented by a sequence of word vectors X = {x 1 , x 2 , · · · , x T }, initialized by the 300-dimensional pre-trained GloVe word vectors 2 (Pennington et al., 2014). For the utterance encoder, we adopt a BiGRU to read the word vectors of an utterance, and produce the hidden state . We apply max-pooling and mean-pooling on the hidden states of all words. The pooling results are summed up, followed by a fully-connected layer, to obtain the embedding of the utterance termed u l : where T denotes the length of the utterance and L is the number of utterances in the conversation. For the conversation encoder, since an utterance could express different meanings in different contexts, we adopt another BiGRU to model the utterance sequence of a conversation to capture the relationship between utterances. The produced hidden states are termed Pre-training Objective. To train the contextdependent encoder on the proposed ConvCom task, we construct a contextual embedding for each masked utterance by combining its context from the history − → H l−1 and the future ← − H l+1 (see Figure 3): Then, the contextual embeddingû l is matched to the candidate answers to find the most suitable one to fill the mask. To compute the matching score, we adopt dot-product with a sigmoid function as:  where σ(x) = 1 (1+exp(−x)) ∈ (0, 1) is the sigmoid function, and u an is the embedding of the nth candidate answer. The goal is to maximize the score of the target utterance and minimize the score of the noise utterances. Thus the loss function becomes: where a 1 corresponds to the target utterance, and the summation goes over each utterance of all the conversations in the training set.

Experiment
Dataset. Our unsupervised conversation data comes from an open-source database OpenSubtitle 3 (Lison and Tiedemann, 2016), which contains a large amount of subtitles of movies and TV shows. Specifically, we retrieve the English subtitles throughout the year of 2016, and collect 25,466 html files. After pre-processing, we obtain 58,360, 3,186, 3,297 conversations for the training, validation, and test sets, respectively.
Evaluation. To evaluate the pre-trained model, we adopt the evaluation metric: which is the recall of the true positives among k best-matched answers from N available candidates for the given contextual embeddingû k (Zhou et al., 2018b). The variate y i represents the binary label for each candidate, i.e., 1 for the target one and 0 for the noise ones. Here, we report R 5 @1, R 5 @2, R 11 @1, and R 11 @2.   Table 1 lists the results on the test set. For the SMALL CODE, it is able to select the correct answer for 70.8% instances with 5 candidate answers and 56.2% with 11 candidates. The accuracy is considerably higher than random guesses, i.e., 1/5 and 1/11, respectively. By increasing the model capacity to MID and LARGE, we further improve the recalls by several points successively. These results demonstrate that CODE is indeed able to capture the structure of conversations and perform well in the proposed ConvCom task.

Results
3 Fine-tuning Strategy 3.1 Experimental Setup ERC Architecture. To transfer the pre-trained CODE models, termed PRE-CODE, to the ERC task, we only need to add a fully-connected (FC) layer followed by a softmax function to form the new architecture. Figure 4 shows the resulting architecture, in which we also concatenate the context-independent utterance embeddings to the contextual ones before fed to the FC.
We adopt a weighted categorical cross-entropy loss function to optimize the model parameters: where |C| is the number of emotion classes, o j is the one-hot vector of the true label, andô j is the softmax output. The weight ω(c) is inversely proportional to the ratio of class c in the training set with a power rate of 0.5.    ERC Datasets. We conduct experiments on five ERC datasets for the ERC task, namely, IEMO-CAP (Busso et al., 2008), Friends , EmotionPush , EmoryNLP (Zahiri and Choi, 2018), and MOSEI (Zadeh et al., 2018). For MOSEI, we pre-process it to adapt to the ERC task and name the pre-processed dataset as MOSEI * here. See Appendix A.3 for details of the ERC datasets.
Evaluation. To evaluate the performance of our models, we report the macro-averaged F1score (Zahiri and Choi, 2018) and the weighted accuracy (WA)  of all emotion classes. The F1-score of each emotion class is also presented for discussion.
Results. We train the implemented baselines and fine-tune the PRE-CODE on the five datasets. Each result is the average of 5 repeated experiments. See Appendix A.3 for training details. We report the main results in Table 2 and Table 3. As seen, our PRE-CODE outperforms the compared methods on all datasets in terms of F1score by at least 2.0% absolute improvement. We also conduct significance tests by using two-tailed paired t-tests over the F-1 scores of PRE-CODE and CODE-MID. P-values are obtained as 0.0107, 0.0038, 0.0011, 0.0003, and 0.0068 for IEMOCAP, EmoryNLP, MOSEI * , Friends, and EmotionPush, respectively. Therefore, the result for IEMOCAP is statistically significant with a significance level of 0.05 whereas the other four datasets obtain a significance level of 0.01. It demonstrates the effectiveness of transferring the knowledge from unsupervised conversation data to the ERC task.
To inspect which aspects pre-training helps the most, we present the F1-score of each emotion class on IEMOCAP and EmoryNLP in Figure 5. As seen, our PRE-CODE particularly improves the performance on minority emotion classes, e.g., anger and sadness in IEMOCAP, and peaceful and sad in EmoryNLP. These results demonstrate that pre-training can ameliorate the issue of imbalanced performance on minority classes while maintaining good performance on majority classes.

Discussion
Model Capacity. We investigate how the model performance is affected by the number of parameters, as seen in Table 4. We find that: (1)   PRE-CODE consistently outperforms CODE in all cases, suggesting that pre-training is an effective method to boost the model performance of ERC regardless of the model capacity.
(2) PRE-CODE shows better performance in the capacities of SMALL and MID, we speculate that the datasets for ERC are so scarce that they are incapable of transferring the pre-trained parameters of the LARGE PRE-CODE to optimal ones for ERC.
Layer Effect. We study how different pretrained layers affect the model performance, as seen in Table 5. CODE+Pre-U denotes that only the parameters of utterance encoder are initialized by PRE-CODE. From CODE to CODE+Pre-U and then to PRE-CODE, we conclude that pre-training results in better utterance embeddings and helps the model to capture the utterance-level context more effectively. In addition, PRE-CODE+Re-W represents that we re-train PRE-CODE for 10 more epochs to adjust the originally fixed word embeddings. The results suggest that pre-training word embeddings does not improve the model performance necessarily but may corrupt the learned utterance and conversation encoders.
Qualitative Study. In Table 6, we provide two examples for a comparison between CODE and PRE-CODE. The first example is from Friends with consecutive utterances from Joey. It shows that CODE tends to recognize the utterances with exclamation marks "!" as Angry, while those with periods "." as Neutral. The problem also appears on PRE-CODE for short utterances, e.g., "Push!", which contains little and misleading information. This issue might be alleviated by adding other

Conclusion
In this work, we propose a novel approach to leverage unsupervised conversation data to benefit the ERC task. The proposed conversation completion task is effective for the pre-training of the contextdependent model, which is further fine-tuned to boost the performance of ERC significantly. Future directions include exploring advanced models (e.g., TRANSFORMER) for pre-training, conducting domain matching for the unsupervised data, as well as multi-task learning to alleviate the possible catastrophic forgetting issue in transfer learning.

A.1 Related Work
Pre-training on unsupervised data has been an active area of research for decades. Mikolov et al. (2013) and Pennington et al. (2014) lead the heat on learning dense word embeddings over raw text for downstream tasks. Melamud et al. (2016) propose to learn word embeddings in the context with the use of LSTM, which is able to eliminate word-sense ambiguity. More recently, ELMo (Peters et al., 2018) extracts context-sensitive features through a language model and integrates the features into task-specific architectures, achieving state-of-the-art results on several major NLP tasks. Unlike these feature-based approaches, another trend is to pre-train some architecture through a language model objective, and then fine-tune the architecture for supervised downstream tasks (Howard and Ruder, 2018;Radford et al., 2018;Devlin et al., 2019). With trainable parameters, this kind of approaches are more flexible, attaining better performance than their feature-based counterparts. However, the idea of pre-training a contextdependent encoder using unsupervised conversation data for the ERC task has never been explored. On one hand, existing works on ERC focus on modeling the speakers, context, and emotion evolution (Poria et al., 2017;Hazarika et al., 2018a,b;Jiao et al., 2019Jiao et al., , 2020. No prior work has tried to solve the issue of data scarcity. On the other hand, existing works on transfer learning focus on pretraining universal sentence encoders, e.g., ELMo, GPT, and BERT. But our PRE-CODE, beyond sentence level, is dedicated for sentence sequences from conversations or speeches. As a result, the pre-training task needs to be customized, for which we propose the ConvCom task. Partially inspired by Word2vec (Mikolov et al., 2013) and response selection task (Tong et al., 2017), our ConvCom task differs in that it should model the order of context meanwhile both historical and future context are provided. In contrast, Word2vec neglects the order of context words, and response selection task usually provides only historical context.  which contains a large amount of subtitles of movies and TV shows. Specifically, We retrieve the English subtitles throughout the year of 2016, including 25466 .html files. We extract the text subtitles from all the .html files and pre-process them as below: • For each episode, we remove the first and the last ten utterances in case they are instructions but conversations, especially in TV shows; • We split the conversations in each episode randomly into shorter ones with five to one hundred utterances, following a uniform distribution; • A short conversation is removed if over half of its utterances contain less than eight words each. This is done to force the conversation to capture more information; • All the short conversations are randomly split into a training set, a validation set, and a test set, following the ratio of 90:5:5. Table 7 lists the statistics of resulting sets, where #Conversation denotes the number of conversations in a set, Avg. #Utternace is the average number of utterances in a conversation, and Avg. #Word is the average number of tokens in an utterance. Totally, there are over 2 million of utterances in over 60k conversations, which is at least 100 times more than those datasets for ERC (see Table 8).
Noise Utterances. We randomly sample ten noise utterances for each utterance in the training set, validation set, and test set. In each set, a conversation shares the ten noise utterances sampled from elsewhere within the set. During training, we can either use the pre-selected noise utterances or sample an arbitrary number of noise utterances dynamically. We use the validation set to choose model parameters, and evaluate the model performance on the test set.  Training Details. We choose Adam (Kingma and Ba, 2015) as the optimizer with an initial learning rate of 2 × 10 −4 , which is decayed with a rate of 0.75 once the validation recall R 11 @1 stops increasing. We use a dropout rate of 0.5 for the utterance encoder and the conversation encoder, respectively. Gradient clipping with a norm of 5 is also applied to avoid gradient explosion. Each conversation in the training set is regarded as a batch, where each utterance plays the role of target utterance by turns. We randomly sample 10 noise utterances for each conversation during training and validate the model every epoch. The CODE is pre-trained for at most 20 epochs, and early stopping with a patience of 3 is adopted to choose the optimal parameters. Note that, we fix the word embedding layer during pre-training to focus on the utterance encoder and the conversation encoder.
A.3 Fine-tuning Strategy ERC Datasets. Our PRE-CODE and the implemented baselines are fine-tuned on five ERC datasets, namely, IEMOCAP 5 (Busso et al., 2008), Friends 6 , EmotionPush 7 , EmoryNLP 8 (Zahiri and Choi, 2018), and MOSEI 9 (Zadeh et al., 2018). For MOSEI, we pre-process it to adapt to the ERC task and name the pre-processed dataset as MOSEI * here. Specifically, we utilize the raw transcripts of MOSEI, where over 14k utterances are not annotated, and others are labeled with one or more emotion labels. For the unlabeled utterances, we just remove them from the dataset. For the utterance with more than 5 https://sail.usc.edu/iemocap/ 6 http://doraemon.iis.sinica.edu.tw/ emotionlines 7 http://doraemon.iis.sinica.edu.tw/ emotionlines 8 https://github.com/emorynlp/ emotion-detection/ 9 http://immortal.multicomp.cs.cmu.edu/ raw_datasets/ one emotion label, we determine its primary emotion by the majority vote or the highest emotion intensity sum if there are more than one majority votes. For the utterances that obtain zero vote for all emotion classes, we annotate them as other.
For the first three datasets, we follow previous work (Poria et al., 2017; to consider only four emotion classes, i.e., anger, joy, sadness, and neutral. We consider all the emotion classes for EmoryNLP as in (Zahiri and Choi, 2018) and six emotion classes (without neutral) for MOSEI * . All the datasets contain the training set, validation set, and test set, except for IEMOCAP. So, we follow (Poria et al., 2017) to use the first four sessions of transcripts as the training set, and the last one as the test set. The validation set is extracted from the randomly-shuffled training set with the ratio of 80:20. We present the statistic details of datasets in Table 8.
Training Details. We still choose Adam as the optimizer and tune the learning rate for the implemented baselines. Generally, the learning rate of 2 × 10 −4 works well for all the datasets except MOSEI * , on which we find 5 × 10 −5 works better. For the fine-tuning of PRE-CODE, we use the learning rate of the baselines or its half and report the better results here. We monitor the macroaveraged F1-score of validation set and decay the learning rate once the F1-score stops increasing. The decay rate and patience of early stopping are 0.75 and 6 for all the datasets except IEMOCAP. Since IEMOCAP has much fewer conversations, we change the decay rate and patience of early stopping to 0.95 and 10, respectively.