Enhancing Cognitive Models of Emotions with Representation Learning

We present a novel deep learning-based framework to generate embedding representations of fine-grained emotions that can be used to computationally describe psychological models of emotions. Our framework integrates a contextualized embedding encoder with a multi-head probing model that enables to interpret dynamically learned representations optimized for an emotion classification task. Our model is evaluated on the Empathetic Dialogue dataset and shows the state-of-the-art result for classifying 32 emotions. Our layer analysis can derive an emotion graph to depict hierarchical relations among the emotions. Our emotion representations can be used to generate an emotion wheel directly comparable to the one from Plutchik’s model, and also augment the values of missing emotions in the PAD emotional state model.


Introduction
Emotion classification has been extensively studied by many disciplines for decades (Spencer, 1895;Lazarus and Lazarus, 1994;Ekman, 1999). Two main streams have been developed for this research: one is the discrete theory that tries to explain emotions with basic and complex categories (Plutchik, 1980;Ekman, 1992;Colombetti, 2009), and the other is the dimensional theory that aims to conceptualize emotions into a continuous vector space (Russell and Mehrabian, 1977;Watson and Tellegen, 1985;Bradley et al., 1992). Illustration of human emotion however is often subjective and obscure in nature, leading to a long debate among researchers about the "correct" way of representing emotions (Gendron and Feldman Barrett, 2009).
Representation learning has made remarkable progress recently by building neural language models on large corpora, which have substantially improved the performance on many downstream tasks (Peters et al., 2018;Devlin et al., 2019;Yang et al., 2019;Liu et al., 2019;Joshi et al., 2020). Encouraged by this rapid progress along with an increasing interest of interpretability in deep learning models, several studies have attempted to capture various knowledge encoded in language (Adi et al., 2017;Peters et al., 2018;Hewitt and Manning, 2019), and shown that it is possible to learn computational representations through distributional semantics for abstract concepts. Inspired by these prior studies, we build a deep learning-based framework to generate emotion embeddings from text and assess its ability of enhancing cognitive models of emotions. Our contributions are summarized as follows: 1 • To develop a deep probing model that allows us to interpret the process of representation learning on emotion classification (Section 3).
• To achieve the state-of-the-art result on the Empathetic Dialogue dataset for the classification of 32 emotions (Section 4).
• To generate emotion representations that can derive an emotion graph, an emotion wheel, as well as fill the gap for unexplored emotions from existing emotion theories (Section 5).

Related Work
Probing models are designed to construct a probe to detect knowledge in embedding representations. Peters et al. (2018) used linear probes to examine phrasal information in representations learned by deep neural models on multiple NLP tasks. Tenney et al. (2019) proposed an edge probing model using a span pooling to analyze syntactic and semantic relations among words through word embeddings. Hewitt and Manning (2019) constructed a structural probe to detect the correlations among word pairs to predict their latent distances in dependency trees. As far as we can tell, our work is the first to generate embeddings of fine-grained emotions from text and apply them to well-established emotion theories. Figure 1: The overview of our deep learning-based multi-head probing model. NLP researchers have produced several corpora for emotion detection including FriendsED (Zahiri and Choi, 2018), EmoInt (Mohammad et al., 2017), EmoBank (Buechel and Hahn, 2017), and Daily-Dialogs (Li et al., 2017), all of which are based on coarse-grained emotions with at most 7 categories. For a more comprehensive analysis, we adapt the Empathetic Dialogue dataset based on fine-grained emotions with 32 categories (Rashkin et al., 2019).

Multi-head Probing Model
We present a multi-head probing model allowing us to interpret how emotion embeddings are learned in deep learning models. Figure 1 shows an overview of our probing model. Let W = {w 1 , . . . , w n } be an input document where w i is the i'th token in the document. W is first fed into a contextualized embedding encoder that generates the embedding e 0 ∈ R d 0 representing the entire document. The document embedding e 0 is then fed into multiple probing heads, PH 11 , . . . , PH 1k , that generate the vectors e 1j ∈ R d 1 comprising features useful for emotion classification (j ∈ [1, k]). The probing heads in this layer are expected to capture abstract concepts (e.g., positive/negative, intense/mild).
Each vector e 1j is fed into a sequence of probing heads where the probing head PH ij is defined The feature vectors e * from the final probing layer are expected to learn more fine-grained concepts (e.g., ashamed/embarrassed, hopeful/anticipating). e * are concatenated and normalized to g ∈ R d ·k and fed into a linear layer that generates the output vector o ∈ R m where m is the total number of emotions in the training data. It is worth mentioning that every probing sequence finds its own feature combinations. Thus, each of e * potentially represents different concepts in emotions, which allow us to analyze concept compositions of these emotions empirically derived by this model.

Contextualized Embedding Encoder
For all experiments, BERT (Devlin et al., 2019) is used as the contextualized embedding encoder for our multi-head probing model in Section 3. BERT prepends the special token CLS to the input document W such that W = {CLS} ⊕ W is fed into the ENCODER in Figure 1 instead, which generates the document embedding e 0 by applying several layers of multi-head attentions to CLS along with the other tokens in W (Vaswani et al., 2017). 2

Dataset
Although several datasets are available for various types of emotion detection tasks (Section 2), most of them are annotated with coarse-grained labels that are not suitable to make a comprehensive analysis of emotions learned by deep learning models.

TRN
DEV TST ALL C 19,533 2,770 2,547 24,850 L 18.2 (±10.4) 19.6 (±11.4) 23.0 (±12.5) 18.9 (±10.8) To demonstrate the impact of our probing model, the Empathetic Dialogue dataset is selected, that is labeled with 32 emotions on ≈25K conversations related to daily life, each of which comes with an emotion label, a situation described in text that can reflect the emotion (e.g., Proud → "I finally got that promotion at work!"), and a short two-party dialogue generated through MTurk that simulates a conversation about the situation (Rashkin et al., 2019). For our experiments, only the situation parts are used as input documents.

Results
Several multi-head probing models are developed by varying the number of probing layers and the dimension of feature vectors to find the most effective model for interpretation. For all models, a linear layer is used for every probing head such that

Layer-wise Analysis
To analyze which emotional concepts are embedded in each probing layer (Section 3), we train a logistic regression model on the concatenated vector of (e i1 ⊕ · · · ⊕ e ik ) for each layer i with the same configuration used for the 3-layer model, 128:64:32 (Table 2), and tested on the development set. For each pair of adjacent layers ( i , j ) where j = i+1 and 1 ≤ i ≤ 2, we measure the likelihood H ij (s, t) of those layers classifying each emotion s as every other emotion t as follows: where * (e g , e p ) is the proportion of the documents whose gold labels are e g but predicted as e p by the model trained on the layer * . If L(s, t) > 0, it means that the higher layer j tends to predict s as t more than the lower layer i . L(t, s) > 0 implies the opposite, and is used as a penalty term to get a more reliable measurement of how much the higher layer is confused s for t than the lower layer.
The results are illustrated in Figure 2, where arrows pointing from one emotion s to another emotion t indicate H ij (s, j) ≥ 2. The dashed arrows and thin solid arrows correspond to the confusion likelihoods of H 12 (s, j) and H 23 (s, j) respectively, and the thick solid arrows reflect the likelihoods in those two metrics. Most emotion pairs point from coarse-grained emotions to fine-grained emotions (e.g., angry → furious, sentimental → nostalgic) except for a few pairs (excited → anticipating), implying that higher probing layers tend to learn more finer-grained emotions that lower layers. Plutchik (1980) introduced the emotion wheel by selecting a reference emotion and arranging others on a circle where the angles are determined by manually assessed similarities between emotion pairs. Inspired by this work, we derive an emotion wheel by creating emotion embeddings and representing each complex emotion as a weighted sum of two basic emotions. Given an emotion e and a set of documents D e whose gold labels are e in the DEV set, the embedding of e can be derived as follows, where g d is the normalized vector in Section 3 for d.

Generation of Emotion Wheel
For each complex emotion c, its combinatory basic emotion pair (b i , b j ) and the weight w ∈ [0.1, 0.9] are founded as follows (r * is the embedding of b * ): [cosine_sim(r i,j,w , c)] (2) Figure 3 depicts the emotion wheel auto-generated by our framework; 8 basic emotions are displayed on the outer circle and complex emotions are displayed on the edges between those basic emotions where the dot scales are proportional to the cosine_ sims in Eq (2). 3 Although the only manual part in this wheel is the selection of those basic emotions from Plutchik (1980), it is compatible to the original emotion wheel in Section A.2 and finds even more relations such as Excited = Anticipating + Joyful, Lonely = Sad + Afraid, and Grateful = Trusting + Joyful.

Augmentation of PAD Model
Russell and Mehrabian (1977) presented the PAD model suggesting that emotions can be denoted by 3 dimensions of pleasure, arousal, and dominance. To verify whether our representations can capture emotional concepts similar to the PAD model, we train a regression model per dimension that takes the emotion embeddings from Eq (1) and learns the corresponding PAD values in Section A.3 manually assessed by Russell and Mehrabian (1977). 3 3 complex emotions whose cosine similarity scores are less than 0.1 are omitted in Figure 3: guilty, jealous, nostalgic.
Note that the original PAD model provides the PAD values for only 22 emotions. Given the 3 regression models trained on those 22 emotions, we are able to predict the PAD values for the other 10 emotions missing from the original model. 4 Figure 4 shows the 2D plot of the PA values predicted by our regression models for Pleasure and Arousal, where the 10 emotions, whose PAD values are newly discovered by our models, are indicated with the red labels. 5 It is exciting to see that the newly discovered emotions blend well in this plot (e.g., anticipating in between anxious and excited). Similar emotions are closer in this space (e.g., sentimental / nostalgic, trusting / faithful / confident), implying the robustness of the predicted values. Notice that the P value of nostalgic is predicted as positive, which is understandable because nostalgic is related to a memory with happy personal associations; thus, it is found to be positive by distributional semantics.

Conclusion
This paper presents a multi-head probing model to derive emotion embeddings from neural model interpretation. Our model is applied to an emotion detection task and shows a state-of-the-art result. These emotion embeddings can derive an emotion graph, depicting how abstract concepts are learned in neural models, and an emotion wheel and PAD values, verifying their potential of augmenting cognitive models for more diverse groups of emotions that have not been explored by cognitive theories.

A.1 Experimental Settings
The BERT model used in our experiment is BERTbase, and Table 3 shows the hyperparameters used to develop the models in Table 2.

A.2 Plutchik's Emotion Wheel
The emotion wheel described in Section 5.2 is inspired by Plutchik (1980) which proposed the eight basic emotions that can constitute other complex emotions through various combinations shown by the emotion wheel in Figure 5, where emotions displayed on the edges are the compositions of those two basic emotions. As can be seen, our derived emotion wheel has some identical emotion relations as the Plutchik's emotion wheel such as Hope = Anticipation + Trust, Anxiety = Anticipation + Fear, and Sentimentality = Trust + Sadness. It suggests the robustness of the emotion wheel derived by the proposed method in Section 5.2.

A.3 Russell and Mehrabian's PAD Model
All regression models in Section 5.3 are based on 2layer multilayer perceptron using the mean square error (MSE) loss, including a hidden layer with the ReLU activation and an output layer with the Tanh activation. The hidden layer dimension is 128, and the dropout rate is 0.3, and early stopping is applied to avoid overfitting. The MSE losses of the three regression models to predict the Pleasure (P), Arousal (A), and Dominance (D) values are 0.028, 0.019, and 0.016, respectively. Table 4 describes the original PAD values of the 22 emotions from Russell and Mehrabian (1977), and Figure 6 shows the 2D plot from the PAD values of those 22 emotions. Table 5 Table 4 and Table 5, most of the predicted values are close to their gold values. Also, we can observe that the predicted values of some newly discovered emotions are consistent with our perception of emotions. For example, Anticipating is very close to Hope in terms of pleasure but with higher intensity.    Table 4.

A.4 Combinatory Emotions Details
In Section 5.2, we propose a framework to find the combinatory basic emotion pairs for each complex emotion by calculating a weighted sum vector of two basic emotion embeddings. Table 6 lists the basis emotion pairs, weights, and cosine similarity for 24 complex emotions derived by our framework.  Table 5.
The weight indicates how much each basic emotion in the pair contributes to the complex emotion and can be interpreted in a proportional manner. For example, Annoyed can be composed of 90% Angry and 10% Anticipating.