Guiding Variational Response Generator to Exploit Persona

Leveraging persona information of users in Neural Response Generators (NRG) to perform personalized conversations has been considered as an attractive and important topic in the research of conversational agents over the past few years. Despite of the promising progress achieved by recent studies in this field, persona information tends to be incorporated into neural networks in the form of user embeddings, with the expectation that the persona can be involved via End-to-End learning. This paper proposes to adopt the personality-related characteristics of human conversations into variational response generators, by designing a specific conditional variational autoencoder based deep model with two new regularization terms employed to the loss function, so as to guide the optimization towards the direction of generating both persona-aware and relevant responses. Besides, to reasonably evaluate the performances of various persona modeling approaches, this paper further presents three direct persona-oriented metrics from different perspectives. The experimental results have shown that our proposed methodology can notably improve the performance of persona-aware response generation, and the metrics are reasonable to evaluate the results.


Introduction
As an essential research topic in generative conversational agents (a.k.a., chat-bots), Persona Modeling is of great importance for such deep neural network based intelligent interactive systems (Li et al., 2016b;Kottur et al., 2017;. Apparently, user-personalitydependent responses provided by a chat-bot is able to significantly improve the consistency of its conversations, meanwhile, it is possible for users to * * Contribution during the internship at Tencent. flexibly customize the persona of a chat-bot based on some existent dialogues. As for the studies on this topic, with no doubt, incorporating persona factors into End-to-End generative models is an attractive topic with great challenges. Intuitively, it is reasonable to take the realvalued user representation as a medium to introduce persona factors into deep learning based conversational models, given the situation that embeddings play a significant role in the current Neural Response Generators (NRG) as one basic component. Such user representations can be either co-trained (Li et al., 2016b;Kottur et al., 2017) or learned from contents generated by specific users (Liu et al., 2018;Mazare et al., 2018;Chu et al., 2018). Actually, the motivation of the user representation oriented approaches is to generate the persona-aware responses based on style derived the past dialogues, rather than adopting the explicit meta-data of user profiles in the generation of responses (Qian et al., 2018;Chu et al., 2018;Song et al., 2019), and the latter route is not included in the discussions of this work.
With the recent development in deep latent variable models Shen et al., 2017;Zhou and Wang, 2018), it is naturally to introduce the user representations into the latent variable models for the persona-aware response generation. However, in the current models, without the explicit learning objectives or constraints, the user representation is adopted in a passive way to reduce the model loss and KL divergence via endto-end learning. In this case, it is highly possible that the employed embeddings will not work as effectively as expected. Meanwhile, there are no convincing metrics to evaluate persona.
Consequently, it is highly necessary to employ explicit guidance to help variational response generators sense persona. persona-contained dialogs, there exist intuitive characteristics for directing the optimization of the persona-aware variational response generation. Obviously, for a given user, the appropriately modeled and leveraged persona information can help to generate hidden variables semantically relevant with corresponding responses. Besides, since users may have their own linguistic style, the adoption of personal information in NRG aims to have direct influence on the degree of linguistic (e.g. lexical and syntactic) convergence for a specific user.
This paper aims at exploring the explicit guidance to help the variational response generator exploit persona information hidden in the nonstructured contents produced by the users, by utilizing intuitive characteristics of personalized conversations for model training. The contributions of this paper can be summarized as follows: • A persona-aware variational response generator is proposed to exploit persona while modeling the conversations.
• Based on the model, two regularization terms are presented to guide the model in encoding user information into the latent variables and converging to user-specific responses.
• Three discriminative metrics are further introduced to evaluate the capabilities of persona-aware response generators.

Approach
Based on the current progress on the latent variable model, we propose a persona-aware variational response generator to automatically exploit persona from the conversations, and utilize such personal information to model the future conversation. Besides, given that personal information can be exploited as optimization guidance to better modeling persona, we further introduce two regularization terms to guide the model learning. In the following section, we first describe the general structure of PAGenerator, and then explain the two additional regularization terms.

Persona-Aware Variational Response Generator
Utilizing latent variables in response generation has become a widely accepted methodology in NRG due to the essence of Bayesian, it helps to deal with external knowledge including persona easily. Therefore, our proposed model is built based on the generation model with latent variables. The overall architecture of the single turn persona-aware variational response generator proposed in this paper is illustrated in Figure 1. Let q, r, u stand for the query, the reply and the corresponding user of r, respectively, and e u stands for the embedding of user u. A bidirectional LSTM is first employed to encode the query and reply into fixed size vectors h q and h r . After that, the prior network (parametrized by θ) takes u e , h q as inputs to generate the distribution p θ (z|q, u) of latent variable z. Meanwhile, h q , h r are fed into a posterior network (parameterized by φ) to compute q φ (z|q, r). As we adopt the assumption that z follows isotropic Gaussian distribution, p θ (z|q, u) and q φ (z|q, r) are also normally distributed, such that: where the means and variances are computed as follows: where W p , W q , b p and b q are the trainable parameters. A sample of z using the reparametrization trick (Kingma and Welling, 2013) is then fed into the decoder as a part of input at each time step. In addition, the bag-of-word (BOW) loss  is employed to tackle the latent variable vanishing problem, and PAGenerator is trained to maximize the variational lowerbound (Chung et al., 2015;Serban et al., 2017):

User Information Enhancing Regularization
Ideally, we expect that the introduction of user embedding is fully utilized during model training. However, due to the KL vanishing problem, the training of PAGenerator suffers from the hazard that the rapid decrease of L in Equation 4 might be attributed to the strong fitting capability of the decoder on the training data, rather than the involvement of user embedding. Thus, we introduce a regularization term to promote the usage of user's hidden information in latent variables. At the beginning, as illustrated in Figure 1, a general unk u is introduced to represent the case for user unspecified.
Subsequently, taking the default user embedding e unk u as input, we obtain the KL divergence as KL(q φ (z|q, r) p θ (z|q, unk u)) from the network. In this case, once the real user u is introduced, a regularization term R 1 (θ, φ; q, r, u) can be constructed as follows: where γ 1 ∈ R, γ 1 > 0, and p θ (z|q, unk u) ∼ N (µ p , σ 2 p I). It should be noted that, according to the equation above, the two prior distributions are generated from the same network with partially different inputs (u VS. unk u), and the regularization constrains the prior distribution with specified user to be closer to the posterior distribution. Thus, the optimization encourages the utilization of user information and correspondingly inhibits the generated results from ignoring the user information. Meanwhile, R 1 in our proposed model also alleviates the KL vanishing problem.

Variance Controlling Regularization
The BOW loss forces the latent variables to predict the bag-of-words in the response. Therefore, the semantic distribution of z is required to be capable of representing the topics and wording of the target response. Besides, for a given query, the possible replies from a specific user should be more convergent to each other than those from an unknown user, due to each user's unique preference on the topics and wording. Correspondingly, under the assumption that the distribution of z represents the user's language preference, the specification of user information is expected to reduce the entropy of the isotropic Gaussian distribution of z, reflected by a lower standard deviation σ p . On this basis, we introduce another regularization term R 2 (θ, φ; q, r, u) to control the variance: where γ 2 ∈ R and γ 2 > 0. R 2 prefers those z with decrease ≥ γ 2 in standard deviation σ p after specifying users, and such decrease indicates the latent variables are more semantically convergent. On this basis, we update the new training objective of PAGenerator as follows: By employing the two regularization terms to constrain the model training, L (θ, φ; q, r, u) now also pays attention to the utilization of user information and language preference.

Specified Evaluation Metrics of Persona NRG
In the previous section, two regularization terms are proposed to guide the model in the persona exploration. However, we still lack effective persona-focused metrics to quantify how well one model is on learning persona. The currently applied metrics for persona-aware NRG evaluation, such as perplexity and BLEU, are used to evaluate the plain NRG models (Li et al., 2016b;Kottur et al., 2017). Apparently, such metrics are inadequate to evaluate the capacity of a response generator on capturing persona. Innately, an effective persona-aware response generator should be able to successfully identify and generate responses for users according to their language styles. Besides, the generated responses from different users should be diversified to each other in wording. Considering these properties, we propose the following metrics to measure the level of persona-aware in response generators.

Language Style Detection
It is important for a persona-aware response generator to identify a user's response from other userirrelevant ones, by detecting the user's language style in responses. In this subsection, we propose User-Relative-Rank (uRank) to measure such capability. Given a query-response-user triple {q, r, u}, a pre-trained seq2seq model S2S and a model M to be evaluated, we first generate n user-irrelevant responses {r i |i ∈ [1, n]} from S2S using beam search. For a desired persona-aware model M , it is expected to assign the ground truth response r with a higher probability than other user-irrelevant ones {r i |i ∈ [1, n]}. Thus, taking S2S as reference, we set uRank to be 1 if M scores r a higher rank among r i than S2S, specifically: where P m (r) and P s2s (r) are the probabilities of {q, r, u} given by M and s2s respectively, |X| presents the cardinal number of a set X. Overall, for model M , its average uRank for different queries denotes the rate of rank promoted groundtruth replies.

Language Style Imitation
Apart from perceiving users' language styles, an effective persona-aware model should also be able to imitate language styles by generating responses satisfying users' language behaviors. User-Language-Perplexity (uPPL) is proposed to measure this property. Given a user u i , to conduct such metric, a statistical language model LM i is first trained using the user's utterances. After that, for a generated response r , its corresponding uPPL is defined as the perplexity of r given by LM i . uPPL quantifies the power of a persona-aware model on generating responses similar to users' history utterances.

Diversity between Users
Finally yet importantly, due to the introduction of user information, given a query, we expect that responses for different users from a personaaware model should be also diversified. Therefore, Users-Distinct (uDistinct) is proposed in this paper to capture such property. Given a query q i and m different users The Chinese SNS corpus is crawled from a Chinese social network service Douban, 1 containing totally 1,022,592 single-turn dialogues from 12,857 users; while the Cornell Movie Dialogues corpus consists of conversations from movie scrips. By cleaning up the Cornell corpus with the opensource script, 2 we obtain 109,952 single-turn dialogues from 9,035 movie characters. The training/test ratios for the two corpora are around 200:1 and 50:1, respectively.
There are two main differences between the two datasets: 1) The scenes of conversations are different. The dialogues in Douban are crawled from an open domain social media. By contrast, since the characters in Cornell movie corpus are assigned with fixed personas, the language styles and habits of users are more templatized. Besides, the language style in Cornell is more oral-like, with many personal pronouns. 2) The average number of utterances for each user of the Douban corpus is around 10 times more than that of Cornell.

Model Variations
To compare with our proposed method, the following models are used as baselines: S2SA Vanilla sequence-to-sequence model with attention (Sordoni et al., 2015).
fact bias S2SA with fact bias for persona modeling (Michel and Neubig, 2018). fact bias is originally proposed in NMT, it models user information as an additional bias vector learned through a factored model in the softmax layer.
Speaker Model Framework proposed by Li et al. (2016b). This model is similar to S2SA + fact bias, except that the user information is added as a part of decoder input rather than bias in the softmax layer.
VAE Standard Variational AutoEncoder for response generation (Serban et al., 2017). In our experiment, we replace the utterance with the query only and apply the the auxiliary BOW loss  in training.
CVAE Conditional Variational AutoEncoder with user information as prior knowledge for modeling persona . Similar to VAE, bagof-words loss is applied in CVAE. Moreover, the component for predicting prior knowledge is removed in our implementation of CVAE.
For a fair comparison, we use the same configurations for all models. The size of word embedding and user embedding are respectively set to 300 and 128. We employ a bi-directional LSTM of hidden size = 256 for encoding, and a LSTM of hidden size = 512 for decoding. For latent models, the dimension of z is set as 128.
All models are optimized using Adam (Kingma and Ba, 2014) with learning rate = 2e−4 and batch size = 128. For latent models, we also use KL annealing (Bowman et al., 2016) (400,000 batches for Douban corpus and 100,000 batches for Cornell Movie corpus) to achieve better performance.

Automatic Evaluation Metrics
To thoroughly evaluate our systems, both standard and persona-focused metrics are employed in our experiments. For standard metrics, we adopt unigram BLEU (BLEU-1) (Papineni et al., 2002) and Word Embedding metrics (Liu et al., 2016) including Embedding Average (Average), Vector Extrema (Extrema) and Greedy Matching (Greedy) to evaluate the semantics of generated responses with regards to ground truths. We use the pretrained word embeddings from (Song et al., 2018) for the Douban corpus and embeddings from (Pennington et al., 2014) for the Cornell movie corpus.
Other than standard metrics, all three proposed methods (uRank, uPPL and uDistinct) are adopted to measure the performance of each model on capturing persona. For uPPL, we use a bi-gram language model for perplexity computation. Since the effectiveness of uPPL relies on the quality of constructed user language models, we pretrain the SLM with the whole training data and afterwards finetune it using each user's utterances. Besides, we drop the users with utterances less than 100 in Douban and 30 in Cornell. The value of uRank, which depends on the rankings of predicted probabilities of responses, is not stable for latent models due to the randomness on sampling z. Therefore, uRank for each latent model is computed by running 10 rounds, so that we obtain 10 ranking results and their corresponding uRank. Then we average the obtained 10 uRank as the final uRank for each latent enhanced model. The later experimental results show that uRank for any latent model varies slightly around ±0.005 each round.

The Human Evaluation Criterion
For the more meaningful evaluation, we also use the crowdsourcing labeling resources of our organization to manually evaluate the relevance as well as the persona of the generated responses. Since the degree of persona reflected in the response is even more difficult to be judged by humans, we simplify the annotation into a "yes or no" task, that is, annotators are only asked to decide whether the response can reflect persona for the given user. Before that, the annotators have to read all the utterances to learn the persona for judging. Moreover, in practice, we limit the number of each user's sample utterances to 100. However, the judgment is inevitably much more subjective. Thus, for each sample, we recruit 11 annotators to label and make the final determination by voting. The evaluation of relevance is relatively easy. For the evaluation of relevance, each query-response pair is crossevaluated by 3 annotators, following the labeling criterion used in Xing et al. (2017),  and Mou et al. (2016). The details about data sampling and labeling guidelines are shown in Appendix A.

Results on the Douban Corpus
We first report the performance on the Douban corpus. The results of automatic evaluating metrics are illustrated in Table 1, numbers in bold mean that the improvement on that metric is statistically significant over other methods (p-value ≤ 0.01). It is observed that the BLEU-1 scores of various models are relatively low and close to each other. We attribute this to the fact that the semantics of possible responses for one query is highly diversified in terms of speaking styles and topics, there might be the situation that only a small portion of words share among the responses except those of high-frequency words (Mou et al., 2016;Liu et al., 2016). However, user enhanced models achieve higher BLEU-1 scores due to their capability in considering the preference of a user. Furthermore, by comparing the models' performances on embedding metrics, we find that all models obtain decent scores, but none of the models outperforms the others significantly. Such phenomenon can be also observed in previous studies (Serban et al., 2017;, due to the fact that all the models generate responses semantically similar to the ground truth. Despite this, PAGenerator achieves the highest score on average, which suggests the responses generated by PAGenerator are more semantically relevant to the ground truths.
While all models perform more or less the same on standard metrics, their experimental results on persona metrics are quite different. All personaaware NRG models outperform S2SA and VAE which contain no user information on the uRank, while the two variational models with user information significantly exceed the rest models. It shows that persona-aware response generators, especially those exploiting user embeddings to generate latent variables, are more sensitive on identifying users' language styles. Among all models with user modeling, our proposed PAGenerator achieves the highest uRank.
The advantage of introducing persona information into NRG is also reflected by uPPL. The replies given by the three models employing user embeddings are more consistent with the user's language style, which indicates that user embedding is useful in learning language style automatically in an End-to-End NRG model. By contrast, since S2SA with fact bias focuses on learning user's bias based on only unigrams, it struggles from achieving a high uPPL which scores from bigram perspective. Moreover, comparing the performance of CVAE to Speaker Model, it appears that utilizing latent variables in standard method cannot further improve uPPL. By contrast, the two new regularizations proposed for persona modeling can help PAGenerator generating replies with more specific persona, the uPPL of which is reduced by 21.2 points compared to CVAE.
As mentioned in previous sections, uDistinct measures the diversity of the generated responses between different users. In general, latent models achieve higher uDistinct than non-latent ones as the randomness brought by the latent variables. Within latent models, the adoption of user information in CVAE only slightly improves its uDis-   tinct compared to VAE without user specification. It indicates that user embeddings are ineffectively utilized in CVAE, and this is the motivation for us to propose new methods for variational response generator. The notable improvement in uDistinct can verify their effectiveness in exploiting persona. To further investigate these improvements, we perform case studies shown in Appendix B. Besides, the comparison among baseline models is consistent with the experiments in previous studies (Li et al., 2016b;Zhou and Wang, 2018), which indicates the proposed metrics are apposite for evaluating the capability of NRG models on capturing persona.

Human Evaluation
To further evaluate the quality of generated responses from each model more subjectively, we also implement human labeling. As shown in Table 2, adjusting unigram distributions for users by fact bias reduces the quality of generated responses. By contrast, all other models produce more high-quality replies comparing with S2SA. Moreover, responses from PAGenerator achieve the best human evaluation result, which indicates that the improvement of persona capturing of PA-Generator does not reduce correlation.
Meanwhile, in the last column, the trend of evaluated results on persona almost consists to those evaluated by proposed automatic evaluation met-rics. The PAGenerator outperforms other models, and some particular parts of replies generated by persona-aware models can reflect the personality. Besides, due to the randomness, some responses given by S2SA and VAE are also labeled as persona-aware. However, fewer high-quality responses generated by S2SA compared to VAE, and thus, the proportion of S2SA is even lower.

Results on the Cornell corpus
As shown in Table 3, the overall trend of the experimental results on Cornell corpus is consistent with that on Douban corpus. The models that are aware of the specified user outperform others slightly on BLEU and Embedding metrics. Regards to persona metrics, the experimental results on Cornell corpus shows two main differences: a) The Speaker Model does not perform that well on user language style detection and generation, mainly because the training data of each user is less than that in Douban corpus. It is hard to automatically model the informative user embedding via target oriented learning without guidance. By contrast, utilizing the KL divergence as the guidance in CVAE effectively improves the experimental results. b) Due to the individual characteristics of movie characters, the user-embeddingenhanced models generate more diverse responses for different users, specially PAGenerator.

Ablation Study
To get a better intuition about how our proposed method works, we implement the ablation tests on Cornell Corpus to analyze the contribution of each component of PAGenerator in persona exploitation. As illustrated in Table 4, adding the user embeddings as a part of decoder inputs brings positive improvements on all the evaluated persona exploiting abilities, reflected by the improvements in three metrics. Without UE, the parameter size of PAGenerator reduces considerably, which is harmful to neural networks on fitting target data. Be-sides, without direct constraints from the decoder, user embeddings mainly act on reducing KL divergence rather than providing more informative latent variables. In addition, it can be observed that without UE, PAGenerator also significantly outperforms VAE in all persona-focused metrics, which demonstrates that R 1 and R 2 are indeed useful for guiding the latent variables to model the semantics under the query and users.
Comparing the ablation results of w/o R 1 and w/o R 2 , we can conclude that both two regularizations promote uRank values. However, PA-Generator w/o R 2 only achieves a mediocre result on uPPL, while only utilizing R 2 damages the model's ability in generating diverse responses for different users. We attribute this divergence to the trade-off between a) shared movie style language between characters (users) and b) different language preferences among characters in the movie scripts. Since R 1 promotes the divergence of z between the specified and unspecified users, removing R 1 raises the difficulty for the model to generate diverse responses toward different users, reflected by the low uDistinct of w/o R 1 . However, in the Cornell movie corpus, users share many language patterns in common to satisfy the requirement of movie style language. Under this circumstance, promoting diversity will more or less sacrifice the model's learning on the shared common patterns, which is vital in evaluating the language cohesion. Therefore, the performance of PAGenerator only with R 1 on uPPL is less-than-ideal.
In contrast, since R 2 emphasizes those patterns often used by a given user, it encourages the distribution of user information to be more aggregate. These differences explain why the results of w/o R 1 and w/o R 2 are opposite to each others on the last two metrics.
In conclusion, the user embedding is an important constraint for the Persona-Aware variational response generator, and R 1 , R 2 can be considered to deploy into the model for different purposes. Furthermore, utilizing all components of PAGenerator described in Figure 1 guarantees a more balanced and relatively best performance in all three evaluated persona exploiting abilities.
6 Related Work

Persona-based Neural Models
Persona-based neural conversation models can be categorized into two major research directions.
One is to directly train a model from conversational data by considering the persona information (Li et al., 2016b;Kottur et al., 2017;Madotto et al., 2019), while the other approach makes use of the profiles or sideinformation of users to generate the aligned responses (Chu et al., 2018;Qian et al., 2018;Mazare et al., 2018). The work described in this paper belongs to the first research direction. Li et al. (2016b) and Kottur et al. (2017) enrich the models by training persona vectors directly and incorporating them into the decoder.  propose three strategies to learn the language style instead of introducing new models.
Apart from the development of the Personabased NRG models, recent researches also attempt to incorporate persona into neural machine translators. Michel and Neubig (2018) propose to learn speaker-specific parameters for the bias term in the output to promote user preferring unigrams, and Wuebker et al. (2018) introduce offset tensors to perform fine-tuning for each user.

Variational Response Generator
The variational response generators have drawn much attention recently, due to the observation that it can be flexible to include the effect from conditions based on its Bayesian architecture Shen et al., 2017) and naturally promote diversity by involving sampling in the generate stage (Serban et al., 2017;Du et al., 2018;Shen et al., 2018).  and Shen et al. (2017) introduce frameworks taking various conditions to influence the model learning. Afterwards, Zhou and Wang (2018) include the emoji into the variational NRG model to generate responses with particular emotions. Actually, these models Shen et al., 2017;Zhou and Wang, 2018) can also be deployed to the persona-aware response generation scenario. The main difference is that the speaker of the response is unpredictable based on the query. Thus, we have introduced the architecture proposed by  and modified it to adapt to the persona-aware generation, for the meaningful comparison. Especially, Song et al. (2019) have utilized persona information into the CVAE architecture, except they focus on modeling and copying users' explicit profiles.
In this paper, we proposed a variational neural network to model the conversation as well as the persona of users. On the basis of the network, two regularization terms are designed to guide the model in emphasizing the importance of the hidden user information. In addition, to better reflect the persona characteristics of the response generation model, three metrics have been introduced to quantify the level of persona of the generated responses. Experimental results show that our approach significantly outperforms other baseline models and the proposed metrics are effective in evaluating the capabilities of models on generating persona-aware responses.

A.1 Labeling Dataset Preparation
For each model with a given query set, three generated responses for each query are randomly sampled from the results given by the beam search with a beam size of 10. Then, a total of 3,000 query-response pairs are prepared for labeling.

A.2 Labeling Criterion of Relevance
The labeling criterion for judging the relevance between the response and the given query is described as follows: 0: the quality of response is poor, it is either irrelevant to the query, or grammatically incorrect. 1: although the response itself is acceptable as a reply, its content is not informative and dull. 2: the response is not only relevant and grammatically correct, but also informative or interesting.

A.3 Human Evaluation Results on the Cornell Corpus
As shown in Table 5, on the English dataset, the comparison results are almost consistent with that discussed in Section 5.2. According to the judgment of annotators, our proposed model outperforms the others from both relevance and persona perspective. However, the overall quality of generated responses for the Cornell queries is not as good as the ones given for the Douban corpus, but the persona reflected more obviously. We attribute this to the difference in the corpus size and the word distribution, which is described in Section 4.1. In detail, the quality of Cornell is influenced by insufficient training conversations. However, with the help of more templatized language styles and habits of Cornell, the persona-aware NRG models can generate more characterful responses.

B Case Studies
As shown in Figure 2, we have selected three users whose utterances can reflect their implicit personal features. For example, the gender of user U 3 in the first case is probably female. The user U 4 in the second case is very possible to be an animation fun. According to the conversation history of user U 3 in the last case, it can be inferred that the user is in the trouble of losing weight. Correspondingly, from the responses generated by PA-Genertor, we can observe that such implicit infor-  mation are adopted by our proposed model to produce persona-aware results. Figure 3 gives additional cases generated by PAGenertor, CVAE and VAE respectively oriented to the same given query. Apparently, every independent user should have his/her own linguistic and personality characteristic. Thus, the results generated for different users are expected to maintain enough diversity. According to the cases in Figure 3, it can be seen that results of PAGenertor keep obvious diversity among different individuals, indicating its better capability of capturing persona of users.