Seq2Emo: A Sequence to Multi-Label Emotion Classification Model

Multi-label emotion classification is an important task in NLP and is essential to many applications. In this work, we propose a sequence-to-emotion (Seq2Emo) approach, which implicitly models emotion correlations in a bi-directional decoder. Experiments on SemEval’18 and GoEmotions datasets show that our approach outperforms state-of-the-art methods (without using external data). In particular, Seq2Emo outperforms the binary relevance (BR) and classifier chain (CC) approaches in a fair setting.

Early work treats this task as multi-class classification (Scherer and Wallbott, 1994;Mohammad, 2012), where each data instance (e.g., a sentence) is assumed to be labeled with one and only one emotion. More recently, researchers relax such an assumption and treat emotion analysis as multilabel classification (MLC, Demszky et al., 2020). In this case, each data instance may have one or multiple emotion labels. This is a more appropriate setting for emotion analysis, because an utterance may exhibit multiple emotions (e.g., "angry" and "sad", "surprise" and "joy").
The binary relevance approach (BR, Godbole and Sarawagi, 2004) is widely applied to multilabel emotion classification. BR predicts a binary indicator for each emotion individually, assuming that the emotions are independent given the input sentence. However, evidence in psychotherapy suggests strong correlation among different emotions (Plutchik, 1980). For example, "hate" may co-occur more often with "disgust" than "joy." An alternative approach to multi-label emotion classification is the classifier chain (CC, Read et al., 2009). CC predicts the label(s) of an input in an autoregressive manner, for example, by a sequenceto-sequence (Seq2Seq) model (Yang et al., 2018). However, Seq2Seq models are known to have the problem of exposure bias (Bengio et al., 2015), i.e., an error at early steps may affect future predictions.
In this work, we propose a sequence-to-emotion (Seq2Emo) approach, where we consider emotion correlations implicitly. Similar to CC, we also build a Seq2Seq-like model, but predict a binary indicator of an emotion at each decoding step of Seq2Seq. We do not feed predicted emotions back to the decoder; thus, our model does not suffer from the exposure bias problem. Compared with BR, our Seq2Emo model implicitly considers the correlation of emotions in the hidden states of the decoder, and with an attention mechanism, our Seq2Emo is able to focus on different words in the input sentence that are relevant to the current emotion. We evaluate our model for multi-label emotion classification on SemEval'18  and GoEmotions (Demszky et al., 2020) benchmark datasets. Experiments show that Seq2Emo achieves state-of-the-art results on both datasets (without using external data). In particular, Seq2Emo outperforms both BR and CC in a fair, controlled comparison.

Related work
Emotion classification is an activate research area in NLP. It classifies text instances into a set of emotion categories, e.g., angry, sad, happy, and surprise. Well-accepted emotion categorizations include the six basic emotions in Ekman (1984) and the eight primary emotions in Plutchik's wheel of emotions (1980). Early work uses manually constructed emotion lexicons for the emotion classification task (Tokuhisa et al., 2008;Wen and Wan, 2014;. Such lexicon resources include WordNet-Affect (Strapparava and Valitutti, 2004), EmoSenticNet (Poria et al., 2014), and the NRC Emotion Intensity Lexicon (Mohammad, 2018).
Distant supervision (Mintz et al., 2009) has been applied to emotion classification, as researchers find existing labeled datasets are small for training an emotion classifier. For example, Mohammad (2012) finds that social media users often use hashtags to express emotions, and thus certain hashtags can be directly regarded as the noisy label of an utterance. Likewise, Felbo et al. (2017) use emojis as noisy labels for emotion classification. Such distant supervision can also be applied to pretrain emotionspecific embeddings and language models (Tang et al., 2014;Ghosh et al., 2017).
In addition, Yu et al. (2018) apply multi-task learning to combine polarity sentiment analysis and multi-label emotion classification with dual attention.
Different from the above studies that use extra emotional resources, our work focuses on modeling the correlations among emotions. This improves multi-label emotion classification without using additional data. A similar paper to ours is the Sequence Generation Model (SGM, Yang et al., 2018). SGM accomplishes multi-label classification by an autoregressive Seq2Seq model, and is an adaptation of classifier chains (Read et al., 2009) in the neural network regime. Our paper models emotion correlation implicitly by decoder hidden states and does not suffer from the drawbacks of autoregressive models.

Methodology
Consider a multi-label emotion classification problem. Suppose we have K predefined candidate emotions, and an utterance or a sentence x can be assigned with one or more emotions. We represent the target labels as y = (y 1 , · · · , y K ) ∈ {0, 1} K with y i = 1 representing that the ith emotion is on.
Our Seq2Emo is a Seq2Seq-like framework, shown as Figure 1. It encodes x with an LSTM, and iteratively performs binary classifications over y i with another LSTM as the decoder.
Encoder. We use a two-layer bi-directional LSTM to encoder an utterance. Specifically, we use both token-level and contextual pretrained embeddings to represent a word in the sentence.
Formally, let a sentence be x = (x 1 , · · · , x M ). We first encode each word x i with GloVe embeddings (Pennington et al., 2014), denoted by GloVe(x i ). We further use the ELMo contextual embeddings (Peters et al., 2018), which processing the entire sentence x by a pretrained LSTM. The corresponding hidden state is used as the embedding representation of a word x i in its context. This is denoted by ELMo(x) i .
We use a two-layer bi-directional LSTM on the above two embeddings. The forward LSTM, for example, has the form Other pretrained models, such as the Tranformerbased BERT (Devlin et al., 2019), may also be adopted. This, however, falls out of the scope of our paper, as we mainly focus on multi-label emotion classification. Empirical results on the GoEmotions dataset shows that, by properly addressing multi-label classification, our model outperforms a Transformer-based model (Table 2).
Decoder. In Seq2Emo, an LSTM-based decoder is used to make sequential predictions on every candidate emotion. Suppose a predefined order of emotions is given, e.g., "angry," "joy," and "sad." The decoder will perform a binary classification over these emotions in sequence. The order, in fact, does not affect our model much, as it is the same for all training samples and can be easily learned. In addition, we feed a learnable emotion embedding as input at each step of the decoder. This enhances the decoder by explicitly indicating which emotion is being predicted at a step.
Different from a traditional Seq2Seq decoder, we do not feed previous predictions back as input, so as to avoid exposure bias. This also allows Seq2Emo to use a bi-directional LSTM as the decoder, which implicitly model the correlation among different emotions.
Without loss of generality, we explain the forward direction of the decoder LSTM, denoted by LSTM − → D . The hidden state at step j is given by where e j is the embedding for the jth emotion, andh − → D j−1 is calculated by the attention mechanism in Luong et al. (2015).
Here, the attention mechanism dynamically aligns source words when predicting the specific target emotion at a decoding step. Let α → j,i be the attention probability of the jth decoder step over the ith encoder step, computed by where M is the number of encoder steps, and s → j,i computes an unnormalized score for each pair of h − → D j and h E i with a learnable parameter matrix W → a . Then, we compute an attention-weighted sum of encoder hidden states as the context vector c → j : The context vector is concatenated with the LSTM hidden state ash They are further concatenated for predicting the emotion in question: where σ is a sigmoid function; w j and b j are the parameters for predicting the jth emotion. Notice that w j and b j are different at decoding different steps, because we are predicting different emotions. This treatment is similar to the binary relevance approach (BR, Godbole and Sarawagi, 2004). Our Seq2Emo implicitly models the correlations among emotions through the decoder's bidirectional LSTM hidden states, which is more suited to multi-label classification than BR's individual predictions. Our Seq2Emo also differs from the classifier chain approach (CC, Read et al., 2009), which uses softmax to predict the next plausible emotion from all candidates. Thus, CC has to feed the previous predictions as input, and suffers from the exposure bias problem. By contrast, we predict the presence of all the emotions in sequence. Hence, feeding back previous predictions is not necessary, and this prevents the exposure bias. In this sense, our model combines the merits of both BR and CC.

Experimental Setup
Datasets. We conduct experiments on two multilabeled emotion datasets: SemEval'18 (Affect in Tweets: Task E-c,  and GoEmotions (Demszky et al., 2020). Compared with GoEmotions, SemEval'18 has fewer emotion categories, and is smaller in size. Both datasets come with standard train-dev-test splits. Appendix A shows the statistics of these datasets.
Baselines. On SemEval'18, we compare our system with the top submissions from the SemEval-2018 competition and recent development. NTUA-SLP (Baziotis et al., 2018) uses large amount of external emotion-related data to pretrain an LSTMbased model. TCS Research's system (Meisheri and Dey, 2018) uses the support vector ma-chine with mannually engineered features: output from LSTM models, emotion lexicons (Mohammad and Kiritchenko, 2015), and SentiNeural (Radford et al., 2017). PlusEmo2Vec (Park et al., 2018) combines neural network models, which are pretrained by using emojis as labels (Felbo et al., 2017). Apart from the competition, Yu et al. (2018) propose DATN, which introduces sentiment information through dual-attention. These aforementioned systems are based on the BR approach. SGM (Yang et al., 2018), however, is a CC-based model for multi-label classification. We include it as a baseline by using its publicly released code. 2 Since GoEmotions dataset is fairly recent, we only include the results originally reported by Demszky et al. (2020).
Settings. For the encoder, we set the two-layer bi-directional LSTM's dimension to 1200. Given the small number of emotions to embed, we set the dimension of decoder LSTM to 400. The GloVe embedding is 300 dimensional, and the ELMo embedding is 1024 dimensional. We use the Adam optimizer (Kingma and Ba, 2015), where the learning rate is set to 5e-4 initially and decayed with cosine annealing. The batch size is set to 16 for SemEval'18, and set to 32 for GoEmotions for efficiency concerns.
We perform 5-fold cross-validation on the combined train-dev split for each experiment. Within each fold, we apply early stopping to prevent overfitting and return the best model based on Jaccard accuracy for testing. We then merge the predicted results over the test set by majority voting. Additionally, we repeat each 5-fold experiment 5 times to further improve reduce noise.

Results
Overall performance. Table 1 presents the results on the SemEval'18 dataset. The proposed Seq2Emo outperforms the top submissions of the SemEval-2018 shared task in general. Compared with the median submission, Seq2Emo outperforms over 10% in the Jaccard accuracy. Admittedly, Seq2Emo performs slightly lower (but comparably) with NTUA-SLP and DATN, both introducing extra emotion/sentiment information through transfer learning. Our work, however, focuses on modeling the multi-label classification problem for emotion analysis and achieves high performance. While both NTUA-SLP and DATN are based on the BR approach, we implement additional baselines for fair comparison. In particular, we implement BR and BR-att variants, where the latter uses an attention mechanism when predicting the emotions, similar to our Seq2Emo. In the same spirit, we also implement a CC-based baseline, which is a Seq2Seq model predicting the next emotion among all candidates. For fair comparison, all of the BR, BR-att, and CC variants are trained with the same setting as our Seq2Emo. In this controlled setting, we observe that the proposed Seq2Emo consistently outperform BR, BR-att, and CC on the SemEval'18 dataset in all metrics.
For the GoEmotions dataset, we show the results in Table 2. Since it is a very new dataset, we can only find previous reported results from Demszky et al. (2020). In addition, we include BR, BR-att, and CC for fair comparison. Results show that Seq2Emo outperforms other models on most of the metrics, except that Seq2Emo is worse than CC on Jaccard accuracy. This is understandable, as we have quite a few metrics with different datasets.
It is worth noting that the model of Demszky et al. (2020) is based on BERT (Devlin et al., 2019). We replicate their approach to obtain all the evaluation metrics. We observe that our replication achieves a similar Macro-F1 to Demszky et al. (2020), and thus our replication is fair. The results show that our Seq2Emo achieves comparable or higher performance than the BERT-based model.
We run one-sided t-tests to compare Seq2Emo with the best competing model that does not use additional data, shown in Tables 1 and 2  ify that most of the comparisons are statistically significant (although some are more significant than others). The two experiments provide consistent evidence on the effectiveness of our Seq2Emo. Seq2Emo with an uni-directional decoder. One of the virtues of Seq2Emo is that it can use a bi-directional LSTM decoder. To show its effectiveness, we perform experiments on Seq2Emo with an uni-directional decoder, denoted as "Seq2Emo (uni)." We show the results in Tables 1 and 2 for SemEval'18 and GoEmotions datasets, respectively. We first observe that Seq2Emo performs better than Seq2Emo (uni), which in turn is better than BR-att that predicts emotions individually. This confirms that our Seq2Emo is able to implicitly model the correlation of different emotions, and that a bi-directional decoder is better than a uni-directional one.
Order of emotions. Both Seq2Emo and the classifier chain (CC) predict emotions sequentially. The difference is that our Seq2Emo predicts the presence (or not) of an emotion in a predefined order. CC predicts the next salient emotion autoregressively, it learns the emotion order from the training data. We try different orders, including the original order in the dataset and the ascending/descending order based on emotion frequency. We also try an order where the emotion frequency first increases and then decreases (concave-down), and vice versa (concave-up). We perform experiments on SemEval'18 and report the Jaccard accuracy and the standard deviations in Table 3.
The results show that Seq2Emo is the least affected by the order of the emotions, whereas the performance of CC varies largely. This verifies that the emotion order does not affect Seq2Emo much as it can be easily learned. CC is more sensitive to emotion order and has a larger variance, as it suffers from the exposure bias problem.
Case study. We conduct case studies in Ap-

Conclusion
In this work, we propose Seq2Emo for multi-label emotion classification. Our approach implicitly models the relationship of different emotions in its bi-directional decoder, and is shown to be better than an individual binary relevance (BR) classifier. Our model does not suffer from the exposure bias problem and also outperforms the classifier chain (CC). In general, we achieve state-of-the-art performance for multi-emotion classification on the SemEval'18 and GoEmotions datasets (without using additional emotion labels).

B Case Study
In Figure 2, we visualize the attention layer of Seq2Emo by plotting the heat map over the attention scores. The emotions shown in each example are the groundtruth labels of the corresponding utterance. We observe that Seq2Emo is able to focus on relevant words when predicting the emotion of interest. In Case 3, for example, the emotions joy and love highly resemble each other, both focusing on the word "laughter." On the other hand, the decoder of Seq2Emo can focus on entirely different words if the emotions are different. In Case 1, we see the emotion anticipation mainly focuses on "see free," whereas the emotion optimism mainly focuses on "is lining up volunteers."