EmpDG: Multi-resolution Interactive Empathetic Dialogue Generation

A humanized dialogue system is expected to generate empathetic replies, which should be sensitive to the users’ expressed emotion. The task of empathetic dialogue generation is proposed to address this problem. The essential challenges lie in accurately capturing the nuances of human emotion and considering the potential of user feedback, which are overlooked by the majority of existing work. In response to this problem, we propose a multi-resolution adversarial model – EmpDG, to generate more empathetic responses. EmpDG exploits both the coarse-grained dialogue-level and fine-grained token-level emotions, the latter of which helps to better capture the nuances of user emotion. In addition, we introduce an interactive adversarial learning framework which exploits the user feedback, to identify whether the generated responses evoke emotion perceptivity in dialogues. Experimental results show that the proposed approach significantly outperforms the state-of-the-art baselines in both content quality and emotion perceptivity.


Introduction
Studies on social psychology suggest that "empathy" is a crucial step towards a more humanized humanmachine conversation, which improves the emotional perceptivity in emotion-bonding social activities (Zech and Rimé, 2005). To design an intelligent automatic dialogue system, it is important to make the chatbot become empathetic within dialogue interactions (Prendinger and Ishizuka, 2005). Therefore, in this paper, we focus on the task of empathetic dialogue generation (Rashkin et al., 2019), which automatically tracks and understands the user emotion information in a multi-turn dialogue scenario.
Despite the achieved successes (Rashkin et al., 2019;Lin et al., 2019), obstacles to establishing an empathetic conversational system are still far beyond current progress: (1) It is still difficult to accurately capture the nuances of human emotion in dialogue generation (Ghosal et al., 2019). (2) Merely relying on the dialogue history but overlooking the potential of user feedback for the generated responses further aggravates the aforementioned deficiencies, which causes undesirable responses (Zhang et al., 2018a). In Figure 1, we give an example of the benchmark dataset EMPATHETICDIALOGUES (Rashkin et al., 2019). Note that the emotional words in the sequence of utterances (in Figure 1, they are "new", "job" in utterance 1, "amazing", "excited" in utterance 2, and "excited" in utterance 3) have fine-grained emotional connections. Without considering fine-grained emotional words, the responses generated by existing methods are trivial and uninformed, even though they expressed the appropriate emotions. Therefore, explicitly modelling the fine-grained emotional factor is necessary.
In this paper, we propose a multi-resolution adversarial empathetic dialogue generation model, named EmpDG, to address above challenges through generating more appropriate and empathetic responses. To capture the nuances of user emotions sufficiently, EmpDG generates responses by jointly taking coarsegrained dialogue-level emotions and fine-grained token-level emotions into account. The multi-resolution emotions are the prerequisites of the response generator to perceive the user's nuanced emotion states. Furthermore, we propose an interactive adversarial learning framework to augment the user's feedback My sister called me this morning and told me she scored a new job with Microsoft!

Proud
That's amazing! She must be excited.
Feedback She is so excited. She's even making more than I am.

Conversing
Utterance 1 Utterance 2 Utterance 3 Figure 1: An empathetic dialogue example from dataset EMPATHETICDIALOGUES. The emotional-related words are highlighted in blue. "Proud" is the coarse-grained emotional label. These emotional words are labelled by an external emotion lexicon (Mohammad and Turney, 2013).
thoughtfully, where two interactive discriminators are designed to identify whether the generated responses evoke the emotion perceptivity regarding both the dialogue history and the user's emotions. Conducted on a benchmark dataset EMPATHETICDIALOGUES, extensive experimental results have verified the effectiveness of the EmpDG in terms of both content quality and empathy quality. We also find that the EmpDG outperforms state-of-the-art baselines significantly.
In general, our main contributions are summarized as follows: • We propose a multi-resolution adversarial neural network, which considers multi-granularity emotion factors in the dialogue context.
• To induce the response generation from users' feedback, we propose an interactive adversarial learning network with two interactive discriminators.
• Experiments show that EmpDG significantly outperforms state-of-the-art baselines in terms of both content quality and empathy quality in the empathetic dialogue generation.

Related Work
Our research is in line with empathetic conversation generation through human-computer interactions, which avoids the additional step of assigning emotion to responses during conversations (Skowron et al., 2013). Several work (Lubis et al., 2018;Rashkin et al., 2018;Zhong et al., 2019;Wei et al., 2019;Chatterjee et al., 2019;Rashkin et al., 2019;Santhanam and Shaikh, 2019;Lin et al., 2019;Zhong et al., 2020;Lin et al., 2020) have attempted to make dialogue models more empathetic. Rashkin et al. (2019) combine existing models in different ways to produce empathetic responses. Lin et al. (2019) softly combine the possible emotional responses from several separate experts to generate the final empathetic response. Lin et al. (2020) fine-tune a large-scale pre-trained language model with multiple objectives. However, all existing approaches only consider monogranular emotions in the dialogue context. In comparison, EmpDG jointly considers highly correlated coarse-grained emotional labels and fine-grained emotional terms. Moreover, we explicitly consider the effect of user feedback via a novel interactive adversarial mechanism, so that EmpDG is capable to evoke more emotion perceptivity in dialogues. Besides the advancements in empathetic dialogue models, the emergence of new emotion-labelled dialogue corpora, e.g., DAILYDIALOG (Li et al., 2017b) and EMOTIONLINES (Hsu et al., 2018), have also contributed to this research field. However, only 5% of the utterances in DAILYDIALOG and 16.68% of the utterances in EMOTIONLINES have diverse emotional labels and others are either "none" or "happy" labels. Because of the extremely imbalanced data distribution, they are not suitable to be engaged as the benchmarks of empathetic dialogue generation. Rashkin et al. (2019) consider a richer and evenly distributed set of emotions and release a dataset EMPATHETICDIALOGUES, where a listener responds to a speaker who is under an emotional situation in an empathetic way. Furthermore, several emotion lexicons (Mohammad and Turney, 2013;Sedoc et al., 2020) have also been shown to be effective in tracking emotions in texts. In our work, we focus on the task of empathetic dialogue generation on the EMPATHETICDIALOGUES dataset and emotion lexicon (Mohammad and Turney, 2013).
The second line of our related work is the emotional dialogue generation, which has received an increasing amount of attention to address emotion factors Colombo et al., 2019). Prior studies on emotion-related conversational systems mainly focused on rule-based systems, which heavily rely on hand-crafted features (Prendinger and Ishizuka, 2005). Recently, many neural emotional dialogue generation approaches (Ghosh et al., 2017;Zhou and Wang, 2018;Colombo et al., 2019) have been explored to control the emotional expression in the target response. However, as  reveal that conventional emotional conversation systems aim to produce more emotion-rich responses according to a given specific user-input emotion, which inevitably leads to an emotional inconsistency problem.
Our research also aligns with recent advances in open-domain dialogue generation models (Vinyals and Le, 2015;Li et al., 2016b;Zhang et al., 2018b;Hancock et al., 2019;Li et al., 2020;Song et al., 2020). These dialogue models usually adopt the sequence-to-sequence (Seq2Seq) (Sutskever et al., 2014) fashion. Adversarial learning achieves considerable success in generating higher-quality responses (Goodfellow et al., 2014;Li et al., 2017a) but often leads to gradient vanishing as the discriminator saturates (Gulrajani et al., 2017). To tackle this problem, Gao et al. (2019) utilize Wasserstein GAN  to enhance response consistency with external facts. Romanov et al. (2019) propose an adversarial decomposition method for fine-grained text representation. Unlike the previous work, we investigate an adversarial approach to improve the empathy quality of neural dialogue models.

Problem Formulation
Before detailing our method, we introduce our key notations and concepts. A multi-turn dialogue context consists of M utterances between two interlocutors. We assume both semantic context and emotional context exist in such dialogue. The semantic context U refers to the sequence of utterances, i.e., U = [U 1 , ..., U M ]. Following Lin (2019), we flat U into a long token sequence and insert a CLS token at the start of the token sentence, i.e., U = [CLS, x 1 , . . . , x m ]. The emotional context E considers emotions with different granularities, i.e., E = [LAB, w 1 , ..., w e ], where w i is the emotional word in the semantic context U, LAB is a special emotion token which is used to derive the emotional state of the dialogue context. We extract emotional words using an external emotional vocabulary V E (Mohammad and Turney, 2013). In the following part, we use x 0 to denote CLS in U and w 0 to denote LAB in E.
Given U and E, our model aims to generate a n-length response Y = {y 1 , ..., y n } through maximizing the probability P (Y|U, E) = N n=1 P (y n |y <n , U, E).

EmpDG
In this section, we detail our proposed multi-resolutional adversarial model, abbreviated as EmpDG. The overview of EmpDG is illustrated in Figure 2. There are two main components in EmpDG: the empathetic generator and the interactive discriminators.
To summarize, the empathetic generator is established based on an encoder-decoder architecture, which are all implemented with Transformer (Vaswani et al., 2017). During encoding procedure, semantic context and the multi-resolution emotional context are encoded; whereas the decoder fuses the semantic context and emotional context to generate responses.
To enhance the empathy of the generator, we design two CNN-based (Kalchbrenner et al., 2014a) discriminators (i.e., the semantic discriminator and the emotional discriminator). In the training procedure, the two discriminators additionally interact with the user feedback (the next utterance and corresponding emotional words of the next utterance). By minimizing the Wasserstein-1 distance to optimize the discriminators, we use the sum of classification results as a training signal to encourage response generator to evoke more emotion perceptivity.
(2) Interactive Discriminators   Figure 2: Overview of EmpDG. We divide EmpDG into two parts: (1) Empathetic generator generates empathetic responses based on the semantic understanding and multi-resolution emotion perception. (2) Interactive discriminators distinguishes whether the generated responses are emotion-appropriate and context-consistent.

Empathetic Generator
We start by proposing an empathetic generator to generate a response Y. To better model the multigranularity emotions, we divide our encoding-decoding process into 3 phases individually: semantic understanding, multi-resolution emotion perception, and empathetic response generation.
Semantic Understanding. We first use a word embedding layer and a positional embedding layer (Vaswani et al., 2017) to convert each token of semantic context U into vector representations e W x i ∈ R d and e P x i ∈ R d , where d is the dimensionality of embeddings. In the multi-turn dialogue settings, distinguishing utterances from speaker and listener is helpful. Therefore, we incorporate the dialogue state embedding e D x i for each token. Our semantic context embedding [x 0 , . . . , x m ] is the composition of three types of embeddings, where x i is computed as follows: Then we use the local transformer layer (Vaswani et al., 2017) to encode semantic context information for each tokens x i .
where LayerNorm is the Layer Normalization trick proposed in (Ba et al., 2016); FFN is a two-layer feed-forward network with ReLU as hidden activation function. The transformer layers are stacked L times. The obtained final context representations are denoted as Multi-resolution Emotion Perception. We use another transformer encoder with a different set of parameters to encode the multi-resolution emotional context E.
where e E w i is the emotion state embedding to distinguish the emotional context from semantic context. Then the multi-resolution emotional context is represented as C e = [w 0 , . . . ,w e ].
To perceive the emotional information in dialogue context, a linear layer with softmax operation projects the concatenation ofw 0 andx 0 into an emotion category distribution P e over the coarsed-grained emotional label e to identify the emotion signal user expressed: where W e ∈ R 2d×q , q is the number of emotion categories, and [; ] is the concatenation operation. During training, given the ground truth emotional label e * of dialogue context, we employ negative log-likelihood as the loss function to conduct the parameter learning: L emo = − log(P e (e * |E)).
In addition, the obtained intermediate emotional representation e p will be fed into the decoder as a crucial emotional feature to guild the empathetic response generation. The final dialogue context representations C is the concatenation of the semantic context vectors C u and emotional context vectors C e , i.e., C = [C u ; C e ].
Empathetic Response Generation. The predicted emotion signal e p ∈ R 1×q is firstly be transformed by a linear transformation into e p ∈ R 1×d . Then we concatenate e p with the embeddings of the decoder input tokens [y 1 , . . . , y j−1 ] into representations E Y = [y 0 . . . , y j−1 ] where y 0 = e p . We feed E Y into the response generator.
The generator is built based on the Transformer layers as well. For each Transformer decoder layer, the decoder inputs E Y are first updated into new vector representations Y. Then a multi-head cross-attention mechanism MH-CAtt (Vaswani et al., 2017) derives a context vector from dialogue context: whereŶ = [ŷ 1 , . . . ,ŷ j ]. The response generator yields the distribution over the vocabulary for the next j-th token: p(y j |y <j , C) = softmax(W oŷj ).
As most dialogue generation tasks, we use standard maximum likelihood estimator (MLE) as the optimization objective: L gen = − log p(y j |y <j , C).
Finally, considering all the aforementioned components, we define a joint loss function as follows: where γ 1 , γ 2 are hyper-parameters that control the weights of the two losses. We set γ 1 = γ 2 = 1. All the parameters are jointly trained in an end-to-end paradigm.

Interactive Discriminators
To evaluate whether the response is generated in an empathetic and context-consistent way, we introduce two discriminators to provide additional training signals for the empathetic generator. A semantic discriminator measures the semantic distance from the generated response to the gold response. An emotional discriminator specifies whether the generated responses are empathetic enough. Specially, the next utterance of response could serve as the user's implicit feedback and provide semantic and emotional guidance for target response (Zhang et al., 2018a). Therefore, we regard the next utterance of response as user semantic feedback and its contained emotional words as user emotional feedback. In the training procedure, we utilize user semantic feedback and emotional feedback to optimize the content and empathy ability of the empathetic generator, respectively. Both the semantic discriminator and the emotional one are built on the convolutional neural network (CNN) based classifier, so we detail the semantic discriminator first for convenience. Semantic Discriminator. First, we apply an LSTM encoder (Hochreiter and Schmidhuber, 1997) to respectively encode the generated response and gold response into hidden representations, i.e., d N t and d P t . We regard d N t as negative vectors and d P t as positive vectors. Thereafter, a two-dimensional convolutional layer (Kalchbrenner et al., 2014b) convolves the hidden vecor d * t with multiple convolutional kernels of different widths, where * ∈ {N, P }. Each kernel corresponds to a linguistic feature detector which extracts a specific pattern of multi-grained n-grams (Kalchbrenner et al., 2014b). A convolutional filter W s maps hidden states in the receptive field to a single feature. As we slide the filter across the negative or positive sequence, a sequence of new features where ReLU is activation function, ⊗ denotes the convolution operation, W s ∈ R d×k and b s ∈ R k are learnable parameters in the convolutional filter. For each convolutional filter, the max-pooling layer takes the maximal value among the convolutional features F * and results in a fixed-size vector f * . Then we obtain the semantic classification result D sem (d * t ) ∈ R through interaction among f * , semantic feedback representation d F , and dialogue context vectorx 0 (we use emotional context vectorw 0 in emotional discriminator): where W d ∈ R 1×d and b d ∈ R are model parameters. d F is the last hidden state derived by the above-mentioned LSTM encoder. Inspired by previous work Gao et al., 2019), we minimize the Wasserstein-1 distance between distributions of positive samples and negative samples. Hence, the discriminator D ∈ D becomes a 1-Lipschitz function, where D is the set of 1-Lipschitz functions (Gulrajani et al., 2017). In order to meet the 1-Lipschitz constraints of discriminator D, we incorporate a gradient penalty of the output of D with respect to its input into the discriminator objective function. The gradient penalty is sampled uniformly along a straight line between points sampled from the negative representations d N t and the positive representations d P t . Then the loss function of semantic discriminator is computed as follows: where α ∼ U[0, 1] is a random value and β denotes a coefficient of gradient penalty term.
Emotional Discriminator. The architecture of the emotional discriminator is the same as that of the semantic discriminator. The main difference is that emotional discriminator conducts on the emotional words of the generated response, gold response, and user feedback (i.e., user emotional feedback). We use L emo d to denote the loss function of emotional discriminator. The total loss L d of interactive discriminators is the loss summation of semantic discriminator L sem d and emotional discriminator L emo d . Meanwhile, we add the summation of −D sem (d N t ) and −D emo (d N t ) to L g to facilitate empathetic generator with more empathy ability.

Training
At the beginning of the training, we use the maximum likelihood estimation (MLE) to pre-train empathetic generator (Eq. 15). Since pre-trained discriminators is effective to help adjust the empathetic generator (Yu et al., 2017), we pre-train the interactive discriminators as well. After the pre-training, the empathetic generator and interactive discriminators are trained alternatively. The dataset provides 32 evenly distributed emotion labels, which act as the coarse-grained dialogue-level emotions. We use the NRC Emotion Lexicons (NRC) (Mohammad and Turney, 2013) to extract the emotional words in dialogue context to conduct fine-grained token-level emotions. In order to supplement the language gap between the training data and NRC, all adjectives not included in NRC are extracted together with NRC emotional words. We treat the dialogue context and fine-grained emotional context as system inputs. The target outputs are a coarse-grained emotion label and the listener's response. For our model, we reserve the next utterance of the target response as user feedback in the training procedure. Finally, we obtain 20,724 dialogues in the training set, 2,972 in the validation set, and 2,713 in the testing set.

Evaluation Methods
Automatic Evaluation. Liu et al. (2016) have verified BLEU might be improper to evaluate the conversation generation problem, as it correlates weakly with human judgements of the response quality; METEOR (Banerjee and Lavie, 2005) and ROUGE (Lin, 2004) have the same problem. Therefore, following previous emotion-related studies Rashkin et al., 2019;Song et al., 2019;Wei et al., 2019), we employ three evaluation metrics to automatically evaluate the performance of our EmpDG: Perplexity (Serban et al., 2015) measures the high-level general quality of the generation model; Distinct-1 and Distinct-2 (Li et al., 2016a) measure the proportion of the distinct unigrams / bigrams in all the generated results to indicate diversity. To evaluate the model at the emotional level, we adopt Emotion Accuracy as the agreement between the ground truth emotion labels and the predicted emotion labels by the empathetic generator.
Human Evaluation. To qualitatively examine model performance from both the content and the empathy perspectives, we also conduct widely-adopted human evaluations. We randomly sample 100 dialogues and their corresponding generations from our model as well as the baselines. We recruit three professional annotators from a third-party company to evaluate the responses generated by different models. All models are evaluated in terms of following 3 metrics: Empathy, Relevance and Fluency (Rashkin et al., 2019;Lin et al., 2019). Empathy measures whether the listener's responses show the understanding of the speaker's feelings; Relevance evaluates whether the generated responses are on-topic with the dialogue context; Fluency measures the grammatical correctness and readability of the generated responses. Each metric is rated on five-scale, where 1, 3, and 5 indicate unacceptable, moderate, and excellent performance, respectively.

Baselines
We conducted extensive experiments to compare EmpDG against the following representative baselines: • Transformer (Vaswani et al., 2017): A Transformer-based sequence to sequence model that is trained based on MLE loss. Additionally, to better analyze the influence of different components in our model, we also conducted ablation tests as follows: • w/o G: The EmpDG model without the multi-resolution emotion factors, where we only consider the coarse-grained dialogue-level emotions and semantic discriminator.  • w/o D: The EmpDG model without the interactive discriminators.

Implementation Details
We implement all models using Pytorch (Paszke et al., 2017) 1 and optimize the models using Adam (Kingma and Ba, 2015) with a mini-batch size of 16. We use pre-trained Glove vectors (Pennington et al., 2014) to initialize the word embedding. During the training of empathetic generator, the learning rate is initialled as 0.0001 and we vary the learning rate following Vaswani et al. (2017). Early stopping is applied when training. When inference, we set the maximum decoding step as 30. All common hyperparameters are the same as the work in (Lin et al., 2019). During the interactive adversarial training, D-steps (for two interactive discriminators) is set to 1 and G-steps (for empathetic generator) is set to 5. Hyper-parameter β in interactive discriminators is set to 0.1. Meanwhile, we employ the teacher-forcing technique from Li et al. (2017a) to increase adversarial training efficiency.

Performance Comparisons
Automatic Evaluation Results. As shown in Table 1, EmpDG achieves the highest scores for most metrics compared with other baselines. The noticeable improvement indicates the effectiveness of our multi-resolution adversarial empathetic dialogue generation model in empathetic expression and response diversity. Although the perplexity score of EmpDG is slightly worse than the EmoPrepend-1 due to the introduction of interactive discriminators, the other scores for EmpDG are obviously better than EmoPrepend-1. EmpPrepend-1 and MoEL have similar performance, as both of them only learn the coarse-grained dialogue-level emotional label to infer emotional situation and generate responses. Without emotion modelling, Transformer only generates fluent responses based on semantic mapping, but fail to express diverse responses.
We also perform an ablation study for better understanding the contributions of the main parts of our model. As shown in Table 2, after we remove the multi-resolution empathetic generator (i.e., w/o G), both the emotion accuracy and distinct metrics performance become obviously worse, demonstrating the effectiveness of multi-resolution emotion factors in emotional understanding and model generation quality. We also investigate the effect of removing interactive discriminators (i.e., w/o D). We notice that the scores of distinct metrics decrease dramatically. This makes sense because user feedback interacts with generated responses to facilitate empathetic generator optimization. Therefore, applying interactive discriminators can bring performance improvement on appropriate emotional expressions.   Human Evaluation Results. Table 1 illustrates that EmpDG obtains the best performance on both Empathy and Relevance scores. This suggests that the multi-resolution adversarial model helps capture implicit emotions, improve the topic consistency, and evoke more emotion perceptivity. As the generated responses by Transformer are already fluent and grammatical, we observe there is no obvious difference among models in terms of Fluency. Additionally, we carried out pairwise response comparison to directly compare the dialogue quality gains in Table 3. The results confirm that the responses from EmpDG are more preferred by human judges.

Analysis of Emotion Interactions
To gain an insight into how well the emotion is expressed in the generated responses, we portrait two examples illustrating the cross-attention weights (between the encoder and decoder of the empathetic generator) of the dialogue context in the Table 4. For the first visualization case, by giving additional attention to the multi-resolution emotional information, the response generated by EmpDG shows much more informative and empathetic, such as words "good" and "incident". For the second visualization case, although MoEL puts the major attention on the emotional words, i.e., "scary", the content of its generated response is inappropriate, which is not enough to evoke much empathy. EmpDG can generate the more contentful response with better emotional expressions, e.g., "sorry" and "what happened" which are also relevant to the user feedback. We can conclude that considering user feedback in adversarial training can teach the model to generate more empathetic responses. In total, the effectiveness of the multi-resolution mechanism and interactive discriminators are proved to some extent. Table 5 shows two examples generated by EmpDG and other baseline models. In the first case, EmpDG generates a coherent and informative response with a proper "hopeful" emotion by replying with words "sorry", "hope", "gets", "better", whereas baselines fail to understand the negative emotions or express the appropriate contents. In the second case, EmpDG generates the most context-consistent response, which Emo. Hopeful S Cont. X1: My husband lost a job but I am hoping he can find a full time job soon. E Cont.

Case Study
lost, job, hoping, find, full, time.

Gold
He will, I have faith.

Transformer
That is great! I am sure he will be fine. EmoPrepend-1 I am sorry to hear that. I hope he gets it.

MoEL
That is a good news. I hope you get it for him. EmpDG I am sorry to hear that. I hope he gets better.
X1: I joined a firm 6 months ago and then I got a promotion for junior manager. X2: Congratulations. That sounds like the fast track. X3: Yes, they thought that I have some extraordinary talent and that is the reason I got that position. E Cont.

Gold
Cool. Maybe next year you will be a partner! Transfomer I am sorry to hear that. EmoPrepend-1 That is great! i am glad you got it! MoEL That is great, I bet it is hard to do that! EmpDG Wow, that is amazing. I am sure they do great. Table 5: Generated responses from Transformer, EmoPrepend-1, MoEL, and our EmpDG in two different speaker's emotional states. S Cont indicates the dialogue semantic context and E Cont indicates the emotional context. X i represents the i-th utterance in semantic context. Tokens in underline represent emotion-related words.

Conclusion
In this paper, we propose an multi-resolution interactive empathetic dialogue model (EmpDG) to evoke more emotion perceptivity in dialogue generation. Two components are proposed to improve the performance of empathetic response generation. A multi-resolution empathetic generator combines coarsegrained dialogue-level and fine-grained token-level emotions to capture the nuanced emotional expressions in dialogue and evoke more emotion perceptivity. Two interactive discriminators utilize user feedback as additional context to interact with the generated response and dialogue context to optimize the long-term goal of empathetic conversation generation. Automatic and manual evaluation have shown that EmpDG can generate responses appropriate not only in content but also in empathy. There are several future directions for this setting. First, one of the potential extension of EmpDG would be incorporating it with external knowledge (e.g., user profile or commonsense knowledge) to help emotion perceptivity. Second, in our setting, the emotional feedback and semantic feedback are separated. It is promising to model the interaction between semantic and emotional feedback. We leave these directions as future work.