EmpGAN: Multi-resolution Interactive Empathetic Dialogue Generation

Conventional emotional dialogue system focuses on generating emotion-rich replies. Studies on emotional intelligence suggest that constructing a more empathetic dialogue system, which is sensitive to the users' expressed emotion, is a crucial step towards a more humanized human-machine conversation. However, obstacles to establishing such an empathetic conversational system are still far beyond current progress: 1) Simply considering the sentence-level emotions while neglecting the more precise token-level emotions may lead to insufficient emotion perceptivity. 2) Merely relying on the dialogue history but overlooking the potential of user feedback for the generated responses further aggravates the insufficient emotion perceptivity deficiencies. To address the above challenges, we propose the EmpGAN, a multi-resolution adversarial empathetic dialogue generation model to generate more appropriate and empathetic responses. To capture the nuances of user feelings sufficiently, EmpGAN generates responses by jointly taking both the coarse-grained sentence-level and fine-grained token-level emotions into account. Moreover, an interactive adversarial learning framework is introduced to further identify whether the generated responses evoke emotion perceptivity in dialogues regarding both the dialogue history and user feedback. Experiments show that our model outperforms the state-of-the-art baseline by a significant margin in terms of both content quality as well as the emotion perceptivity. In particular, the distinctiveness on the DailyDialog dataset is increased up to 129%.


Introduction
As a vital part of human intelligence, emotional perceptivity is playing an elemental role in various social communication scenarios, e.g., education (Kort, Reilly, and Picard 2001) and healthcare systems (Taylor et al. 2017). Recently, emotional conversation generation has received an increasing amount of attention to address emotion factors in an endto-end framework (Zhou and Wang 2018;Colombo et al. 2019). However, as Li and Sun (2018) revealed that conventional emotional conversation systems aim to produce more emotion-rich responses according to a specific user-input emotion, which inevitably leads to an emotional inconsistency problem.
Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Studies on social psychology suggest that empathy is crucial step towards a more humanized human-machine conversation, which improves the emotional perceptivity in emotion-bonding social activities (Zech and Rimé 2005). To design an intelligent automatic dialogue system, it is important to make the chatbot become empathetic within dialogues (Prendinger and Ishizuka 2005). Therefore, in this paper, we focus on the task of empathetic dialogue generation (Lubis et al. 2018), which automatically tracks and understands the user's emotion at each turn in multi-turn dialogue scenarios.
Despite the achieved successes (Lubis et al. 2018;Rashkin et al. 2019), obstacles to establishing an empathetic conversational system are still far beyond current progresses: • Simply considering the sentence-level emotions while neglecting the more precise token-level emotions may lead to insufficient emotion perceptivity. It is difficult to capture the nuances of human emotion accurately without modelling multi-granularity emotion factors in dialogue generation.
• Merely relying on the dialogue history but overlooking the potential of user feedback for the generated responses further aggravates the aforementioned deficiencies, which causes undesirable responses .
In this paper, we propose a multi-resolution adversarial empathetic dialogue generation model, named EmpGAN, to address above challenges through generating more appropriate and empathetic responses. To capture the nuances of user feelings sufficiently, EmpGAN generates responses by taking both coarse-grained sentence-level and fine-grained token-level emotions into account. The response generator in EmpGAN dynamically understands the emotions along with a conversation to perceive the user's emotion states in multi-turn conversations. Furthermore, an interactive adversarial learning framework is augmented to take the user feedback into account thoughtfully, where two interactive discriminators identify whether the generated responses evoke the emotion perceptivity regarding both the dialogue history and the user emotions.
In particular, the EmpGAN contains an empathetic generator and two interactive inverse discriminators. The empathetic generator is composed of three components: (1) A semantic understanding module based on Seq2Seq neural networks which maintain the multi-turn semantic context.
(2) A multi-resolution emotion perception model captures the fine-and coarse-grained emotion factors of each dialogue turn to build emotional context. (3) An empathetic response decoder combines the semantic and emotional context to produce appropriate responses in terms of both the context and the emotion. The two interactive inverse discriminators additionally incorporate the user feedback and the corresponding emotional feedback as inverse supervised signal to induce the generator produce a more empathetic response. Our main contributions are summarized as follows: • We propose a multi-resolution neural network. It considers multi-granularity (sentence-level and token-level) emotion factors in perceiving contextual emotion flow. • We propose an interactive adversarial network, including two interactive discriminators to give semantic and emotional training signals, respectively. Advantageously, the user feedback is an effective posterior signal to induce the generator's behaviour. To the best of our knowledge, this is the first time an adversarial network is introduced into the empathetic conversational system. • Experiments show that EmpGAN improves the performance of empathetic dialogue in terms of both content quality and empathy quality by a large margin, compared to the state-of-the-art baseline models. In particular, the distinctiveness on the DailyDialog dataset is increased up to 129%.

Related work
In the past few years, Sequence-to-Sequence (Seq2Seq) network (Sutskever, Vinyals, and Le 2014) has successfully attracted much attention, especially in the fields of Neural conversational systems (Serban et al. 2016;Yao et al. 2017). Many techniques have been proposed to improve the content quality of dialogue (Li et al. 2016;Zhao, Zhao, and Eskenazi 2017). Adversarial learning enjoyed considerable success in generating higher-quality responses (Goodfellow et al. 2014;Li et al. 2017a) but often leads to gradient vanishing as the discriminator saturates (Gulrajani et al. 2017).
To tackle the problem of model collapses in adversarial methods, Wasserstein GAN (Arjovsky, Chintala, and Bottou 2017) was proposed and applied in responses generation (Gao et al. 2019). However, few work investigated the problems in improving the emotion quality of neural dialogue models.
Prior studies on emotion-related conversational systems mainly focused on rule-based systems, which heavily rely on hand-crafted features (Prendinger and Ishizuka 2005). Ghosh et al.; Zhou and Wang;Colombo et al. (2017; addressed the emotion factors in neural emotional dialogue generation. However, these models mainly aim to control the emotional content of a response either through a manually specified target or through a general term to encourage higher levels of emotion. Recently, there has been an increasing interest in empathetic responding through human-computer interaction, which avoids an additional step of determining which emotion to respond with during conversations (Skowron et al. 2013). Rashkin et al. (2019) released an empathetic dialogue dataset where each dialogue is grounded in a specific situation and a given emotion. However, the lack of explicit emotion information of each utterance makes it difficult to model a more dynamic emotional context that mimics human conversation.
The temporal dynamics of emotions is an essential property in human interactions (Koval and Kuppens 2012). Lubis et al. (2018) first captured users' emotional state for affectsensitive response generation with a positive emotion elicitation strategy corpus. Zhong, Wang, and Miao (2019) incorporated the VAD values (Warriner, Kuperman, and Brysbaert 2013) to embed each word and encourage the generation of affect-rich words in responses. They both considered single emotion granularity in input sentences. In comparison, our model jointly considers highly correlated coarsegrained emotional labels and fine-grained emotional words. Moreover, the candidate utterances generated by robot usually contain implicit feedback and can be used to optimize the conversation generation . In this work, we explicitly consider the effect of user emotional feedback via a novel interactive adversarial mechanism to make empathetic response generator more engaging and evoke more emotion perceptivity in dialogues.

Problem formulation
Given a sequence of dialogue turns D = {U 1 , ..., U M } as the semantic context where U m = {x m , 1 , ..., x m , Lm } is the m-th dialogue turn, the corresponding sequences of emotional words E = {E 1 , ..., E M } where E m = {w m , 1 , ..., w m , Nm } is the m-th emotional words sequence and w m , n is an emotional word, and the emotional labels sequence Lab = {e 1 , ..., e M } where e m corresponds to m-th dialogue turn, the goal of empathetic generator is to generate a response U M +1 = Y = {y 1 , y 2 , ..., y T } that is sensitive to the user's expressed emotion. The emotional words are extracted using an emotion vocabulary V E . The emotion categories are {Anger, Disgust, Fear, Happiness, Sadness, Surprise, Neutral}. Empathetic generator optimizes the parameters to maximize the probability: P (Y |U, E, Lab) = T t=1 P (y t |y <t , U, E, Lab) where Y is the ground truth response.
For generated response Y = {y 1 , y 2 , ..., y T } and gold response U G = U M +2 , emotional words in Y and U G , the goals of the semantic discriminator and emotional discriminator are to distinguish whether the generated response can converse with empathy given by semantic context, emotional context and user's feedback.

EmpGAN model
In this section, we present our proposed model to produce empathetic responses, which falls outside the capability of adversarial learning for dialogue generation (Li et al. 2017a). The overview architecture is illustrated in Figure 1.
An hierarchical encoder (Serban et al. 2016) encodes dialogue context into semantic context vector. Meanwhile, an-  Figure 1: Overview of EmpGAN. We divide it into two parts: (1) In empathetic generator, we obtain the semantic context using semantic understanding module and the emotional context using multi-resolution emotion perception. Then the empathetic responses are generated according to the semantic context and the emotional context by empathetic response decoder. (2) Two interactive discriminators gauge whether the generated responses can converse with empathy based on user's feedback, and we also use the results of discriminators as another training signals.
other hierarchical encoder models fine-grained emotional words into an emotional words vector, an RNN-based label encoder summarizes a coarse-grained emotional label vector. Finally, the decoder fuses the semantic context and multi-granularity emotional context to generate empathetic responses. However, existing approaches easily generate an emotional response but rarely consider the potential feedback it may have. To enhance the empathy of the generator, we use two CNN-based discriminators (Semantic discriminator and Emotional discriminator), which additionally interact with the user feedback (users' responses in this work). By minimizing the Wasserstein-1 distance to optimize the discriminators, we use the sum of classification results as a training signal to encourage response generator to evoke more emotion perceptivity.

Empathetic generator
Semantic understanding. We denote e(x) as the embedding representation of word x. The encoder implemented with gated recurrent units (GRU) (Chung et al. 2014) converts each dialogue turn U m into the vector representation h utt . Then, h utt is injected into a semantic context encoder to get the dialogue context vectors h dlgutt . Multi-resolution emotion perception. To better model the multi-granularity emotions, we perform the semantic modelling and emotional modelling with different embed-ding spaces by utilizing a generic vocabulary and an emotion vocabulary, respectively. A different parameterized hierarchical encoder is utilized to model emotional words sequences. We first encode emotional words sequence E m of each utterance into an emotion vector h emo and then feed it into a higher-level encoder to model the fine-grained emotional context h dlgemo , which will serve as the initial state of emotion label encoder to model the coarse-grained emotional context.
For each emotion label e m , we randomly initialize the emotion label embedding v m , which is used to compute the coarse-grained emotion vector: . Empathetic response decoder. The decoder takes as input the previous state of the decoder d t−1 and the embedding of a previously decoded word e(y t−1 ) to update its state d t : (1) Empathetic attention. At each time step, the hidden state of the decoder d t attends to g i , where g i is the concatenation of the semantic context vector h dlg utt i , the emotional words vector h dlg emo i and the emotional label vector h dlg lab i . After the t-th decoding step, g t summarizes the semantic and emo-tional context to guide the empathetic response generation: (2) where [; ] denotes vector concatenation.
Response generation. The context vector g t , together with the current decoder hidden state d t , is then fed into a linear transformation layer to generate the token distribution P v over the generic vocabulary.
Next utterance emotional words prediction. To ensure that the generated response is capable of perceiving the empathy of the future utterance. Emotional words of the next utterance E n is also generated through another decoder. h dlgemo is fed into an additional RNN to calculate the cross-entropy loss of emotion words: where p e t and o e t are the gold distribution and the predicted distribution over the emotion vocabulary respectively, T is the last step.
Additionally, to enable the emotional label encoder to encode the emotions among the given context accurately, the emotional label vector is then employed to predict the current utterance U m . The prediction-by-label loss is defined as a cross-entropy loss: Ψ lab = −p lab log(o lab ).
Finally, for each dialogue sample, the training loss of empathetic generator is defined as the sum of the response generation loss (cross entropy error between the predicted token distribution o t and the gold distribution p t ), emotion words loss, and the prediction-by-label loss:

Two interactive discriminators
Although such empathetic constraint defined by crossentropy loss loss g induces the generated response and emotional words similar to the ground truth, we are still ignorant of whether the empathetic generator perceives the emotions or not. Not to mention that inducing the generated response to be more empathetic is also elusive. As such, we introduce two discriminators to evaluate whether the response is generated in an empathetic way and elicit more positive emotions. A semantic discriminator measures the semantic distance of the generated response and ground truth response similar to (Li et al. 2017a), and an emotional discriminator specifies whether the generated responses are empathetic enough and emotionally positive. The semantic and emotional discriminators are built upon similar structures, so we detail the semantic discriminator first for convenience.  Figure 2: Semantic discriminator. Interactive discriminator gives a classification result through interaction with whole context, negative samples and gold samples. Semantic discriminator is depicted in Figure 2. We use LSTM (Hochreiter and Schmidhuber 1997) to convert the generated response, gold response and user's feedback into hidden representations, represented as h F , h T and h N : where h F N represents the negative sample and h T N represents the gold sample. Then a two-dimensional convolutional layer convolves the hidden state h * t with multiple convolutional kernels of different widths. Each kernel corresponds to a linguistic feature detector which extracts a specific pattern of multigrained n-grams (Kalchbrenner, Grefenstette, and Blunsom 2014). A convolutional filter W c maps h * t in the receptive field to a single feature. As we slide the filter across the whole sentence, a sequence of new features f * = [f 1 , f 2 , ..., f n ] is obtained: f * t = relu(h * t ⊗W c +b c ), where ⊗ denotes the convolution operation. For each convolutional filter, the max-pooling layer takes the maximal value among the convolutional features f * , resulting in a fixed-size vector F * . Then we obtain the semantic classification result D sem (h * t ) through an interaction between F * and h dlg : Emotional discriminator gauges whether dialogue agents can converse with empathy and the expression of emotion can elicit more positive emotion. We use similar architecture as the semantic discriminator. First, we encode the emotion words in generated response, gold response and next reply into vector representations d F t , d T t and d N t . Then, we concatenate d F t and d N t , d T t and d N t to get d F N t and d T N t respectively. Finally, h dlg and d * (d F N or d T N ), are injected into the emotional discriminator to obtain the empathy-related classification result D emo (d * t ).
A potential problem of SeqGAN (Yu et al. 2017) is that the returned reward from the discriminator could be very sparse and unstable, which may lead the generator to produce unintended and nonsense replies (Gulrajani et al. 2017). Inspired by previous work (Arjovsky, Chintala, and Bottou 2017;Gulrajani et al. 2017), we minimize the Wasserstein-1 distance W (P F , P T ) to alleviate model collapse. In detail, we minimize the cost of transporting mass from gold response distribution P T N to generated response distribution P F N .
The discriminator D is a 1-Lipschitz function. In order to meet the 1-Lipschitz constraints of the interactive discriminators, we add a gradient penalty into the objective function of the two discriminators. We sample the gradient penalty uniformly along a straight line between the gold distribution P T N and the negative distribution P F N . Then our objective function is as follows (x represents h in semantic discriminator or d in emotional discriminator): where α ∈ U [0, 1] is a random number and σ denotes a coefficient of gradient penalty term. Meanwhile, we add the −D(x F N t ) to the loss g of the empathetic generator to strengthen the interactive adversarial training process.

Experiment
We seek to answer the following research questions in our experiments: RQ1: What is the overall performance of Emp-GAN? Does it outperform state-of-the-art baselines? RQ2: What is the effect of each module in EmpGAN? Is multiresolution emotional mechanism helpful to perceive emotional contextual flow? Does the interactive discriminators give a useful training signal to the empathetic generator? RQ3: Can EmpGAN respond with empathy and logic in the inference?

Dataset
We evaluate EmpGAN on the multi-turn dialogue dataset DailyDialog (Li et al. 2017b). In order to better encourage the empathy ability of the proposed model, we use the subset with dynamic emotional label changes in the dialogue process, where each dialogue turn was manually annotated with one of the seven emotion categories: {Anger, Disgust, Fear, Happiness, Sadness, Surprise, Neutral. Additionally, we use the NRC Emotion Lexicons (Mohammad and Turney 2013) to extract the emotional words in each utterance to conduct our emotion vocabulary. In order to supplement the language gap between the training data and NRC, all adjectives not included in NRC are extracted together NRC emotion words. Therefore, each dialogue turn has a corresponding fine-grained emotional word sequence and a coarse-grained emotional label. We then partitioned the dataset into training, validation, and test sets with a ratio of 10:1:1. In total, our dataset contains 11102 dialogues covering 6385 emotional words.
To qualitatively examine model performance from both the content and the empathy perspectives, we conduct widely adopted human evaluations. We randomly sample 100 dialogue samples from the testing dataset. For each sample, we show human annotators semantic context, emotional context(emotional words and emotional labels) and then randomize the order of the responses generated by each comparison model. Annotators were asked to score a response in terms of Content (rating scale is 0,1,2) and Empathy (rating scale is 0,1,2). Content is defined as whether the response is fluency and responsive following the dialogue context. Empathy is defined as whether the emotional expression of a response agrees with the emotional context and has the power to evoke more emotion perceptivity.

Baselines
In order to prove the effectiveness of the proposed framework, we compare our model with the following baselines: • HRED (Serban et al. 2016): A hierarchical recurrent neural network for response generation. • SeqGAN: An adversarial training approach was proposed by (Li et al. 2017a) for dialogue generation. The outputs from the discriminator are used as rewards for the generative model. To prove the effectiveness of each module in EmpGAN, we also conduct some ablation models for multi-resolution emotion perception and interactive discriminators.

Implementation details
We use Pytorch 1 to implement our experiments. We randomly initialize the network parameters at the beginning of our experiments. The RNN hidden size, word embedding size, emotional word embedding size and emotional label embedding size are set to 400, 300, 200 and 100 respectively. The size of the general vocabulary and the emotion vocabulary is 11647 and 6389. We optimize the models using Adam (Kingma and Ba 2015) with a mini-batch size of 32. Dropout is set to 0.4, and the learning rate is initialled as 0.0001. During the training of SeqGAN and our proposed model, we employ the teacher-forcing technique from (Li et al. 2017a) to increase training efficiency.

Performance comparison
For research question RQ1, we examine the overall performance in term of BLEU, Distinct, ROUGE and human evaluation. Shown in Table 1, EmpGAN is shown to achieve the highest scores for all the three metrics compared with other baselines. That is, the responses it generates perceive as more natural and diverse. One note is that the three metrics of the models considering emotional factors are generally higher than the general models, which means that incorporating emotional factors into models is a more natural way to learn the goal distribution. We observe that EmpGAN achieves a 7%, 129% and 29% increment over the state-ofthe-art baseline Emo-HRED in terms of BLEU, Distinct and ROUGE score, respectively. Specifically, EmpG is on par with Emo-HRED on BLEU, while EmpG achieves a noticeable improvement on Distinct. This observation demonstrates that, compared with Emo-HRED that considers only emotional labels, empathetic generator efficiently takes advantage of multi-granularity emotional factors and sufficiently perceives emotional context resulting in more diversified responses. For human evaluation, we report averages and standard deviations, showing that annotators agree with each other's judgments in most cases. Table 2 illustrates that EmpGAN outperforms other baseline models in both content and empathy quality. In detail, the score of Content verifies that the responses generated by the proposed model perceive as more fluency and natural. The score of Empathy denotes that our responses can be more in line with the emotional contextual flow and evoke more emotion perceptivity.  To address research question RQ2, we conduct ablation tests on the usage of multi-resolution mechanism and discriminators, which are shown in Table 1. The comparison between EmpG and Emo-HRED on the three metrics has proved the helpfulness of multi-resolution mechanism for the model performance. When compared with EmpG, Emp-GAN achieves 6.4, 3.69 and 1.82 gains on BLEU, Distinct and ROUGE respectively, which verifies the effectiveness of interactive discriminators that interact with responses, overall context and user feedback. In the model EmpD, we adopt the Vanilla GAN where the outputs from a discriminator are used as rewards for the EmpG. We obtain a small increment from EmpG to EmpD in terms of BLEU and Distinct, which demonstrates the effectiveness of discriminator. Another improvement from EmpD to EmpWD-next indicates the performance of EmpWD-next benefits from Wasserstein adversarial learning with gradient penalty. When injecting the user feedback into the discriminator, the EmpWD achieves a 0.18, 0.85 and 0.25 improvement over EmpWD-next on BLEU, Distinct and ROUGE. We conclude that user feedback as an additional signal helps the discriminator generate more efficient outputs by interacting with responses and context to facilitate empathetic generator optimization. Comparing EmpGAN and EmpWD, EmpWD shows a comparable performance on Distinct, whereas the performance on BLEU and ROUGE decrease, which partially verifies that the emotional discriminator induces the model generate more grammatical responses than before.

Analysis of emotion interaction
Now we turn to RQ3. Both the automatic and manual evaluation show consistent incremental improvements when applying multi-resolution emotional mechanism and interactive discriminators to the empathetic dialogue system. Table  4 lists the top 10 words from the generated responses of different models in order of frequency. We can see that each model contains several emotional words due to the characteristic of our dataset. It may be of interest to note that although Emo-HRED and EmoG contain the same number of emotion words, the overall position of emotion words of EmoG is higher than that of Emo-HRED, meaning more fre-   quent usage. As a whole, EmpGAN contains relatively more emotional words. This actually follows the human strategy when promoting healthy emotional experiences in conversations -by using responses that contain emotional words. Compared with EmoG and other baselines, the effectiveness of the multi-resolution mechanism and interactive discriminators are proved to some extent. Table 3 shows two examples and its corresponding generated responses by the different models. For the first example, HRED and SeqGAN only generate responses which are fluent but contradictory to the emotional context, because of not considering emotional factors. Due to there are only coarse-grained emotional labels in Emo-S2S and Emo-HRED, even though the generated responses conform to the current emotional states, they have difficulty in generating diverse and long responses. Although the emotion expressed in the response of EmoG is faint, it is suitable to express neutral emotion according to the dialogue context. By using interactive discriminators, EmpGAN produces the response which is not fluency but also emotional consistently with the context. For the second example, except SeqGAN expressing the wrong emotion, the responses from the the other baselines all contain emotional words. However, we note that the response from EmpGAN contains more factual content based on rational emotion expression.

Case study
There are also some cases where EmpGAN does not perform well. For example, there exist some short and generic responses, such as "I 'm not sure"and "Ok". Some responses also occasionally contain repeated segments, like "I 'm going to a party tonight . I 'm going to a party with my birthday . ". This phenomenon indicates that we could further improve network architecture to balance the grammaticality of content and expressions of emotions jointly.

Conclusion
In this paper, we proposed an adversarial empathetic dialogue system (EmpGAN) to evoke more emotion perceptivity in dialogue generation. Two mechanisms were proposed to improve the performance of empathetic response generation. A multi-resolution empathetic generator combines coarse-and fine-grained emotional factors to capture the dynamical emotion contextual flow and evoke more emotion perceptivity. Two interactive discriminators utilize user feedback as additional context to interact with the dialogue context and the generated response to optimize the longterm goal of empathetic conversation generation. Automatic and manual evaluation have shown that EmpGAN can generate responses appropriate not only in content but also in empathy.
As to future work, we will explore the implicit emotional feedback implied in the responses to further enhance the empathetic dialogue system. Also, we would like to extend our research problem to knowledge-based empathetic dialogue generation.

Code
To facilitate the reproducibility of the results in this paper, we are sharing the code at http://url.suppressed. for.anonymity.