MIME: MIMicking Emotions for Empathetic Response Generation

Current approaches to empathetic response generation view the set of emotions expressed in the input text as a flat structure, where all the emotions are treated uniformly. We argue that empathetic responses often mimic the emotion of the user to a varying degree, depending on its positivity or negativity and content. We show that the consideration of this polarity-based emotion clusters and emotional mimicry results in improved empathy and contextual relevance of the response as compared to the state-of-the-art. Also, we introduce stochasticity into the emotion mixture that yields emotionally more varied empathetic responses than the previous work. We demonstrate the importance of these factors to empathetic response generation using both automatic- and human-based evaluations. The implementation of MIME is publicly available at https://github.com/declare-lab/MIME.


Introduction
Empathy is a fundamental human trait that reflects our ability to understand and reflect the thoughts and feelings of the people we interact with. In the social sciences, research on empathy has evolved into an entire field of study, addressing the social underpinning of empathy (Singer and Lamm, 2009), the cognitive and emotion aspects of empathy (Smith, 2006), and its connection to personal and demographic traits (Dymond, 1950;Eisenberg et al., 2014;Krebs, 1975). The study of empathy has found a wide range of applications in healthcare, including psychotherapy (Bohart and Greenberg, 1997) or more broadly as a mechanism to improve the quality of care (Mercer and Reynolds, 2002).
Computational models of empathy have been proposed only in recent years, partly because of * signifies equal contribution User I am so excited because I am finally going to visit my parents next month! I did not see them for 3 years. Joyful Empathetic Response (GOLD) 3 years is a long time. How come? Figure 1: An instance where a positive context is responded with ambivalence.
the complexity of this behavior which makes it difficult to emulate with computational approaches. In natural language processing, the methods proposed to date address the tasks of understanding expressions of empathy in newswire (Buechel et al., 2018), counseling conversations (Pérez-Rosas et al., 2017), or generating empathy in dialogue (Shen et al., 2020;Lin et al., 2019). Work has also been done on the construction of empathy lexicons (Sedoc et al., 2020) or large empathy dialogue datasets (Rashkin et al., 2019). In this paper, we address the task of generating empathetic responses that mimic the emotion of the speaker while accounting for their affective charge (positive or negative). We adopt the idea of emotion mixture, as the state-of-the-art MoEL (Lin et al., 2019), to achieve the appropriate balance of emotions in positive and negative emotion groups. However, inspired by Serban et al. (2017), we introduce stochasticity into the mixture at emotiongroup level for varied responses. This becomes particularly important in cases where the input utterance can be responded with ambivalent, yet befitting utterances. Fig. 1 shows one such example where the response to a positive utterance is ambivalent.
The paper makes two important contributions. First, it introduces a new approach for empathetic generation that encodes context and emotions, and uses emotion stochastic sampling and emotion mimicry to generate responses that are appropriate and empathetic for positive or negative statements. We show that this approach leads to performance exceeding the state-of-the-art when trained and evaluated on a large empathy dialogue dataset. Second, through extensive feature ablation experiments, we shed light on the role played by emotion mimicry and emotion grouping for the task of empathetic response generation.

Related Work
Open domain conversational models have made good progress in recent years Vinyals and Le, 2015;Wolf et al., 2019). Many of them can generate persona-consistent  and diverse (Cai et al., 2018) responses, but those are not necessarily empathetic.
Producing empathetic responses requires apt handling of emotions and sentiments Winata et al., 2017;Bertero et al., 2016).  model psychological concepts as memory states in LSTM (Hochreiter and Schmidhuber, 1997) and employ emotion-category embeddings in the decoding process. Wang and Wan (2018) presents a GAN (Goodfellow et al., 2014) based framework with emotion-specific generators. On a larger scale, (Zhou and Wang, 2018) use the emojis in Twitter posts as emotion labels and introduce an attention-based (Luong et al., 2015) Seq-to-Seq (Sutskever et al., 2014) model with Conditional Variational Autoencoder (Sohn et al., 2015) for emotional response generation. However, they only produce affective responses with userprovided emotion, which may not necessarily be empathetic to the speakers. Wu and Wu (2019) introduce a dual-decoder network to generate responses with given sentiment (positive or negative). Shin et al. (2020) formulate a reinforcement learning problem to maximize user's sentimental feeling towards the generated response. Lin et al. (2019) present an encoder-decoder model with each emotion having a dedicated decoder.

Methodology
Our model MIME is based on the assumption that empathetic responses often mimic the emotion of the speaker (Carr et al., 2003) -in our case, the human subject or user. For example, positivelycharged utterances are usually responded with positive emotions, although they can also be ambivalent as illustrated in Fig. 1. On the other hand, responding to negatively-charged utterances often requires composite emotions that agree with the user's emotion, but also tries to comfort them with some positivity, such as hopefulness or silver lining. As such, we strive to balance the mimicry of context/user emotion during empathetic response generation.
To this end, we first obtain context representation using a transformer encoder architecture (Vaswani et al., 2017). Similar to the state-of-the-art (SOTA) model MoEL (Lin et al., 2019), we enforce emotion understanding in the context representation by classifying user emotion during training. For the response emotion, we first group the 32 emotions into two groups containing positive and negative emotions (Section 3.3). Next, a probability distribution of emotions is sampled for each of these groups that corresponds to the emotion of the response. Positive and negative response emotion representations are formed from these distributions and emotion embeddings. These two representations are appropriately combined to balance the two kinds of emotions to form the emotional representation that drives the emotional state during response generation using transformer decoder (Vaswani et al., 2017). Fig. 2 shows the architecture of our model.

Task Definition
Given the context utterances [u 0 , u 1 , . . . , u n−1 ], where utterance u i = [w i 0 , w i 1 , . . . , w i m−1 ] consists of maximum m words, the task is to generate an empathetic response to the last utterance u n−1 , which is always from the target speaker or user. All the even-numbered (u 0 , u 2 , . . . ) and oddnumbered (u 1 , u 3 , . . . ) utterances belong to the user and the empathetic agent, respectively. Optionally, the context/user emotion e can be predicted for emotion understanding. The emotions are listed in Table 1.

Context Encoding
Following the MoEL system (Lin et al., 2019), firstly, the contextual utterances are sequentially concatenated into a string of k (≤ mn)  Figure 2: Architecture of our model (MIME).
As in MoEL, each word w i j is represented as a sum of three embeddings (E C ): semantic word embedding (E W ), positional embedding (E P ), and speaker embedding ( Also as in MoEL, a transformer encoder (Vaswani et al., 2017) is used for context propagation within the utterances and words in C. Moreover, inspired by BERT (Devlin et al., 2019), one additional token CT X is prepended to the context sequence C to encode the entirety of the context: where TR ctx Enc is the transformer encoder of output size D h and H ∈ R (k+1)×D h contains the contextenriched representations of the contextual words. A context-enriched representation of the CT X token, c, is taken as the overall context representation: (3) Emotion Embedding and Classification. As in MoEL and also as in Rashkin et al. (2018), to explicitly infuse emotion into the context representation c, we train a emotion classifier on c. We train emotion embeddings E E ∈ R nemo×D h (n emo = 32 is the number of emotion classes) to represent each emotion. We maximize the similarity between c and the user-emotion representation E E (e), e being the user-emotion label, using cross-entropy loss L cls : where W E ∈ R D h ×D h and s, P ∈ R nemo .

Response Generation (Decoder)
Our primary assumption behind this model is that the empathetic agent mimics the user's emotion to some degree during response. Specifically, positive emotion is often responded with closely positive response. Negative emotion, however, is likely responded with negativity mixed with some positivity to uphold the moral.
Emotion Grouping. We split the 32 emotion types into two groups containing 13 positive and 19 negative emotions, as listed in Table 1. This split is guided by our intuition.
in the response-emotion determination that results in emotionally more varied responses. In Table 7, we present responses generated by MIME with and without stochasticity. To this end, we sample response-emotion distributions d pos and d neg , from the context C -specifically, c in Eq. (3) -, for both positive and negative emotion groups, respectively. Hence, we sample an unnormalized distribution z g (g ∈ {pos, neg}) from distribution P θ (z g |C). This z g is passed to a fully-connected layer (FC dg ) with softmax activation to obtain the normalized distribution d g ∈ R ng (n pos = 13 and n neg = 19): The emotion representation for each emotion group, e g ∈ R D h , is obtained by pooling the corresponding emotion embeddings using the respective distribution d g : where E Eg ∈ R ng×D h are emotion embeddings in the emotion group g -as defined in Table 1. Sampling from distribution P θ (z g |C) is reparameterized as follows: where g ∈ {pos, neg}, FC * are fully-connected layers with output sizes D h . Following Serban et al. (2017), P θ (z g |C) is obtained by maximizing the evidence lower-bound (−L ELBO g ): where Q ψ (z g |e g , C) is the approximate posterior distribution, defined as: = N (µ posterior g (e g , C), σ posterior g (e g , C)), (17) which is similarly reparameterized, for sampling during the training only, as P θ (z g |C), except e g is concatenated to c.
Emotion Mimicry. Following Carr et al. (2003), it is reasonable to assume that the empathetic response to an emotional utterance would likely mimic the emotion of the user to some degree. Responding empathetically to positive utterances usually requires positivity, occasionally including ambivalence (Fig. 1). On the other hand, the responses to negative utterances should contain some empathetic negativity, but mixed with some positivity to soothe the user's negativity. Thus, we generate two distinct response-emotion-refined context representations -mimicking and non-mimickingthat are appropriately merged to obtain responsedecoder input. Naturally, mimicking and non-mimicking emotion representations -m and m -are defined as follows: Firstly, response-emotion representations -m and m -are concatenated to the context-enriched word representations in H 1:k (Eq. (2)) to provide the context (C) the cues on the response emotion: where H resp , H resp ∈ R k×2D h are fed to a transformer encoder (TR resp Enc ) to obtain mimicking and non-mimicking response-emotion-refined context representations M and M , respectively: where M, M ∈ R k×D h .

Response-Emotion-Refined Context Fusion.
Enabling a mixture of positive and negative emotions could lead to diverse response generation as compared to considering exclusively positive or negative emotions. To achieve this mixture, we concatenate M and M at word level, as opposed to sequence level, to obtain M ∈ R k×2D h . Then, M is fed to a gate consisting of a fully-connected layer (FC contrib ) with sigmoid activation, resulting M contrib that determines the contribution of postive and negative response-emotion-refined contexts to the response to be generated. Subsequently, M is multiplied with the gate output, yielding the refined context M adjust that is fed to another fully-connected layer FC f used to obtain the fused response-emotion-refined context M f used ∈ R k×D h : Response Decoding. For the final response generation from the response-emotion-refined context M f used , following MoEL, a transformer decoder (TR resp Dec ), with M f used as key and value, is employed: where O ∈ R t×D h , t is the number tokens in response R (R 0 is <start> token), FC decode is a fully-connected layer of output size |V |also the vocabulary size -, P resp ∈ R t×|V | , and p(R i |C, R 0:i−1 ) is the probability distribution on each response token. Finally, categorical cross-entropy quantifies the generation loss with respect to the gold response R gold :

Training
Naturally, we combine all the losses for model training: Total loss L is optimized using Adam (Kingma and Ba, 2015) optimizer with learning-rate, patience, and batch-size set to 0.0001, 2, and 16, respectively.
Loss weights, α, β, and γ are set to 1. For the sake of comparability with the SOTA, the semantic word embeddings (E W ) are initialized with GloVe (Pennington et al., 2014) embeddings. All the hyperparameters are optimized using grid search on the validation set, resulting D h and beam-size being 300 and 5, respectively.

Experimental Settings
During inference, we use the emotion classifier (Eq. (5)) with emotion grouping (Table 1) to determine the positivity or negativity of the context that is necessary for the mimicking and nonmimicking emotion representations.

Dataset
We evaluate our method on EMPATHETICDIA-LOGUES 1 (Rashkin et al., 2018), a dataset that contains 24,850 open-domain dyadic conversations between two users, where one responds emphatically to the other. For our experiments, we use the 8:1:1 train/validation/test split, defined by the authors of this dataset. Each sample consists of a contextdefined by an excerpt of a full conversation and the emotion of the user -and the empathetic response to the last utterance in the context. There are a total of 32 different emotion categories roughly uniformly distributed across the dataset.

Baselines and State of the Art
We do not compare MIME with affective response generation models  as they require the response emotion to be explicitly provided, and the response may not necessarily be empathetic. As such, MIME is compared against the following models: Multitask-Transformer Network (Multi-TR).

Mixture of Empathetic Listeners (MoEL).
This state-of-the-art method (Lin et al., 2019) performs user-emotion classification as Multi-TR. However, in contrast to our method, it employs emotion-specific decoders whose outputs are aggregated and fed to a final decoder to generate the empathetic response.

Evaluation
Although BLEU (Papineni et al., 2002) has long been used to compare system-generated response against the human-gold response, Liu et al. (2016) argues against its efficacy in open-domain where the gold response is not necessarily the only correct response. Therefore, as MoEL, we keep BLEU mostly as reference. Following MoEL and Rashkin et al. (2018), we rely on human-evaluated metrics: Human Ratings. Firstly, we randomly sample four instances of each of the 32 emotion labels from the test set, resulting in a total of 128 instances, compared to the 100 instances used for the evaluation of MoEL. Next, we ask three human annotators to rate each sub-sampled model response on a scale from 1 to 5 (best score) on three distinct attributes:   Following Table 2, responses from MIME show improved empathy over MoEL and Multi-TR. We surmise this was achieved by modeling our primary intuition of appropriately mimicking user's emotion in the context thorough stochasticity and positive/negative grouping. Moreover, the usage of trained emotion embeddings (E E ), shared between the emotion classifier and response decoder, seems to encode refined context-invariant emotional and emotion-specific linguistics cues that may lead to empathetically-improved response generation. The SOTA model, MoEL, does train a similar emotion embedding, but it is setup as the key of a keyvalue memory (Miller et al., 2016) which leads to a weaker connection with the decoder, resulting in less emotional-context flow. We believe this embedding sharing further leads to improved relevance rating for MIME, since contextual information flow is now shared between emotion embeddings and encoder output (Eq. (2)). This sharing intuitively leads to refinement in context flow.
However, we also observe that the responses from our model have worse fluency than the other models, including MoEL. This might be attributed to the very structure of the decoder, that seems to refine emotional context well. This may have coerced the final transformer decoder to focus more on emotionally-apt tokens of the response than appropriate stop-words that have no intrinsic emotional content, but lead to grammatical clarity.
Human A/B Test. Based on the results in Table 3, we note that the responses from MIME are more often preferable to humans than the responses from MoEL and Multi-TR. This correlates with the results in Table 2 that indicate better empathy and contextual relevance for MIME. Further, the annotators prefer the responses from MIME with stochasticity (STC) than otherwise. Table 7 shows the impact of stochasticity on the responses.
Performance on Positive and Negative User Emotions. We observe ( Table 4) that the responses generated by MIME for both positive and negative user emotions are generally better in terms + Emotion Mimicry   Table 4: Models performance on positive and negative context; Pos and Neg stand for positive and negative emotions, respectively; the best score for each metricpolarity combination is highlighted by bold font.
of empathy and fluency. Interestingly, MoEL seems to perform better on responding to negative emotions than to positive emotions in terms of empathy and fluency. We posit this stems from the abundance of negative samples in the dataset as compared to positive samples -13 positive and 19 negative emotions roughly uniformly distributed. This may suggest that MoEL is more sensitive to positive/negative context imbalance in the dataset than MIME and Multi-TR.  Effect of Emotion Mimicry. To assess the contribution of user-emotion mimicry, we disabled it by passing e g (Eq. (10)) directly to Eqs. (20) and (21). This results in a substantial drop in empathy, by 0.2 as per Table 5. We delve deeper by plotting the emotion embeddings produced with and without emotion mimicry in Fig. 3a and Fig. 3b, respectively. It is evident that the separation of positive and negative emotions clusters is much clearer with emotion mimicry than otherwise, suggesting better emotion understanding in the prior case through emotion disentanglement. On the other hand, we observe slight increase of relevance, by 0.03. We surmise this is caused by the absence of the confounding effect of swapping the value of m and m, in Eqs. (18) and (19), depending on the user emotion type. This may coerce the same set of parameters to learn processing both positive and negative emotions.

Ablation Study
Effect of Emotion Grouping. Looking at Table 5, we observe a performance drop in both empathy and relevance, by 0.73 and 0.02, respectively, in the absence of emotion grouping. This indicates the importance of having positive and negative emotions treated separately, rather than huddling them into a single distribution. We posit that the latter case causes all the emotions to compete for importance which may lead to emotion uniformity in some cases or one emotion-type overwhelming the other in other cases. This in turn may lead to emotionally mundane and generic responses.

Case Study
Context Capturing. Based on the comparative results for relevance shown in Table 2, MIME appears to generate responses that are a closer fit to  Empathetic Response (GOLD) That is really good you feel pretty secure about it then? Figure 4: A test sample where MIME responds with key information from the context. the context than MoEL does. Fig. 4 shows a test instance where MIME pulls key information from the context -the word 'interview' -to generate an empathetic and relevant response. The response from MoEL is also empathetic, but somehow more generic. We surmise that this can be attributed to the two-way context flow through the emotion embedding sharing and encoder output, as discussed in Section 5.1.
Similarly, Fig. 5 shows a conversation with an apprehensive user who shares a frightening story with a positive outcome. Here, MoEL fixates on the initial negative emotion of the user and replies with an unwarranted negatively empathetic response. MIME, however, responds with appropriate positivity hinted at the last utterance. Moreover, it is able to correctly interpret the events described as a 'beautiful memory', which is truly empathetic and relevant. Again, strong mixture of context and emotion, facilitated by the emotion embedding sharing, is likely to be responsible for this. We show more

Context
Empathetic Response (GOLD) It sounds like a great experience! Figure 5: A test sample where MIME responds with subtle information from the context. examples generated by both MoEL and MIME in Table 6.

Error Analysis
Low Fluency. As evidenced by Table 2, MIME falters in fluency as compared to MoEL. Fig. 6 shows an instance where MoEL generates an empathetic, yet somewhat generic, and fluent response. In contrast, the first response utterance from MIME -"I would have been to the police" -does make contextual sense. However, the second utterance "I would be a little better" reads incoherent and semantically unclear. Perhaps the model meant something like 'I would have felt a little safer'. We repeatedly observed such errors, leading to poor fluency. Given the empathy-and relevance-focused structure of our model, we think MIME focused on learning empathy and relevance, at the cost of

Context
Empathetic Response (GOLD) That is always number one goal. Figure 6: A test sample where MIME responds with a malformed utterance. fluency. We believe this issue could be mitigated with additional training samples.
Response to Surprised User Context. In our experiments, we assumed the emotion surprised to be positive (Table 1), and thus MIME responds with positivity to most test instances incurring surprise as a user emotion. However, this is not accurate, as one can be both positively and negative surprised -"I recently found out that a person I [...] admired did not feel the same way. I was pretty surprised" vs "This mother's day was amazing!". We posit that re-annotating the instances with a negatively-surprised user with a new negative emotion, namely shocked, should help alleviate this issue significantly.
Emotion Classification. The {top-1, top-2, top-5} emotion-classification accuracies for MoEL are {38%, 63%, 74%}, as compared to {34%, 58%, 77%} for MIME. Since the emotion embeddings are shared between encoder and de-coder in MIME, it supposedly also encodes some generation-specific information in addition to pure emotion as discussed in Section 5.1, thereby hindering the overall emotion-classification performance. Notably, MIME also performs well on top-5 classification. This is likely due to MIME's ability to discern positive and negative emotion types -as indicated by Fig. 3a -that comes into prominence as you add more likely-labels into the consideration of top-k classification by raising k.

Conclusion
This paper introduced a novel empathetic generation strategy that relies on two key ideas: emotion grouping and emotion mimicry. Also, stochasticity was applied to the emotion mixture for varied response generation. We have shown through several human evaluations and ablation studies that our model is better equipped for empathetic response generation than existing models. However, there remains much room for improvement, particularly in terms of fluency where our model falters. Moreover, emotions like 'surprise' and 'anticipation' might be explicitly dealt with due to their ambiguous polarity.