You Impress Me: Dialogue Generation via Mutual Persona Perception

Despite the continuing efforts to improve the engagingness and consistency of chit-chat dialogue systems, the majority of current work simply focus on mimicking human-like responses, leaving understudied the aspects of modeling understanding between interlocutors. The research in cognitive science, instead, suggests that understanding is an essential signal for a high-quality chit-chat conversation. Motivated by this, we propose Pˆ2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding. Specifically, Pˆ2 Bot incorporates mutual persona perception to enhance the quality of personalized dialogue generation. Experiments on a large public dataset, Persona-Chat, demonstrate the effectiveness of our approach, with a considerable boost over the state-of-the-art baselines across both automatic metrics and human evaluations.


Introduction
Thanks to the advance in neural models and the accessibility of massive datasets, open-domain dialogue (i.e.chit-chat) systems have made great progress towards mimicking human-like responses.Nevertheless, there still exist some serious challenges in building personalized chatbots that can deliver engaging conversations and gain user trust (Song et al., 2019).For example, current chit-chat systems tend to generate uninformative responses (Li et al., 2016b).Moreover, they are usually lack of coherent personality traits due to the fact that training dialogues actually come from a diverse set of speakers (Zhang et al., 2018b).* Work done during an internship at Microsoft Research.

Persona Persona
Hi! Seen any good movies lately?
I have been to the movies.
I love The Godfather, one of my favorites!Was that filmed?
I don't believe so.I don't watch movies more of a writer.
What do you write?Any diet books ?I am not very healthy.Several attempts have been made to alleviate the above issues.Methods like special reward shaping to reduce generic responses (Li et al., 2016b) and representing the speakers with latent variables (Li et al., 2016a) were introduced to improve the engagingness of chit-chat systems.A more straightforward approach, which equips chit-chat systems with predefined personas, was proposed accompanied by a novel dataset, PERSONA-CHAT (Zhang et al., 2018b).Figure 1 shows a clipped dialogue from PERSONA-CHAT.Two interlocutors meet for the first time and are having a conversation in order to get to know each other.What makes PERSONA-CHAT unique is that personas of both interlocutors are explicitly described using several profile sentences, facilitating the training of chatbots with configurable and persistent personalities.
PERSONA-CHAT has fueled a growing interest in developing methods for personalized dialogue generation.Mazaré et al. (2018) incorporated additional data from Reddit to train the model.Wolf et al. (2019b) fine-tuned pretrained language model (Radford et al., 2018) to improve the dialogue generation.Although both works demonstrate promising results, they focus more on mimicking the style of human-like responses, leaving understudied the aspects of explicitly modeling understanding between interlocutors.Our work, instead, takes the perspective of understanding modeling.
According to the research in cognitive science, effective communication creates similar activation maps in the brains of both interlocutors (Hasson et al., 2012), suggesting that understanding between interlocutors is an essential signal for a highquality chit-chat conversation.For instance, in the conversation shown in Figure 1, the two interlocutors foster understanding either by raising personarelated topics, "Seen any good movies lately?", or by revealing their own personas through answering questions, "I don't watch movies more of a writer.".The efforts to build understanding keep the conversation flowing.
Taking into account the above, we propose Persona Perception Bot (P 2 BOT), explicitly modeling the understanding between interlocutors with a transmitter-receiver framework.Distinguished from traditional methods, P 2 BOT highlights a novel concept, mutual persona perception, which is better suited to describe the information exchange process that empowers the interlocutors to get to know each other.In order to train P 2 BOT for personalized dialogue generation, we employ supervised training and self-play fine-tuning piloted by reward signals characterizing mutual persona perception.Experiments on the PERSONA-CHAT dataset demonstrate the superiority of our approach over the baselines in both automatic met-rics and human evaluations 1 .

Methodology Overview
The central idea of P 2 BOT is to explicitly model understanding between interlocutors and enhance dialogue generation via mutual persona perception.It comprises two components, Transmitter and Receiver, respectively responsible for dialogue generation and mutual persona perception.Figure 2 gives an overview of P 2 BOT: interlocutor A has a persona w A , described with L profile sentences When she first meets the other interlocutor B, they are going to know each other through a N -turn dialogue , where x A n denotes the utterance that A says in nth turn and N denotes the number of total turns.Given the entire dialogue history up to n-th turn , and transmits it to B. The same process applies to B, keeping the conversation flowing.
As the conversation goes on, impressions are gradually built via utterances.For example, when A says "I don't watch movies more of a writer.", the impression that "A is a writer." is left on B's mind.As mentioned above, a successful conversation helps interlocutors know each other, which means B's impression of A should correspond to A's persona and vice versa.Receiver aims to measure the proximity between the built impressions and the actual personas.Specifically, as demonstrated by the dashed black lines in Figure 2, Receiver first projects impressions and personas into a latent space, and then measures the relevance between them based on the impression encoding (e.g.H A , B's impression on A, projected from A's  "Block" is short for "Transformer Block".Arrows bridge the current block to subsequent blocks of its following layer.Position encoding is to incorporate position information into block by assigning an embedding for each absolute position in the sequence.Here we omit the architecture inside the block, and refer the readers to Vaswani et al. (2017) for more details.
[MASK] tokens are ignored in the training objective.utterances x A ), and persona encoding (e.g.W A , projected from A's persona w A )2 .The relevance scores serve as mutual persona perception rewards, and are further incorporated into the training of Transmitter.Details of the two components are presented in Section 3 and 4.

Transmitter
Following previous work (Li et al., 2016b;Zhang et al., 2018b), we treat dialogue generation as a sequence generation problem.Concretely, we employ the pretraining transformer language model introduced in Radford et al. (2018) (i.e.GPT) to initialize Transmitter.The entire training procedure consists of two steps: (1) Supervised Dialogue Generation.We optimize Transmitter via maximum likelihood estimation (MLE) on the supervised dialogue generation task.(2) Self-play Model Finetuning.We simulate dialogues between two randomly paired interlocutors, encouraging Transmitter to learn a policy that maximizes reward signals via reinforcement learning (RL) (Sutton et al., 1999).The design of the reward function considers both language modeling and our proposed mutual persona perception.

Supervised Dialogue Generation
As illustrated in Figure 3, Transmitter follows the overall architecture of 12 stacked transformer layers to encode context and generate response.Here, the context contains the persona w A , the dialogue history h A n , and several special tokens (e.g. [PS] which indicates the start of persona).Given a training instance (w A , h A n , x A n ), the training objective of MLE is to maximize the conditional loglikelihood as: where θ is the parameter of Transmitter.x A n,t means the t-th token in x A n , and x A n,<t indicates the token sequence before t-th token.Equation 1, hereafter simplified as log p θ (x A n | w A , h A n ), applies to both A and B, and we mention A for the sake of brevity (the same as below).
During inference, beam search is applied to store top-ranked response candidates {x A n }, and Transmitter subsequently chooses as prediction the one that maximizes the length-normalized score: Besides the sequence generation task, inspired by Wolf et al. (2019b), we set up an auxiliary task, Next Utterance Prediction.Apart from training Transmitter to generate responses, we also train it to discriminate whether the response is the next utterance of the given context.Concretely, we append a special token [CLS] to the tail of the generated tokens.A classifier is built on top of the token's hidden state in the last transformer layer, as indicated by the red rounded rectangle in Figure 3.In training, for each response, we randomly sample a distractor and train the classifier to give a higher score on the response than the distractor.In inference, the classifier is used to rank response candidates together with Equation 2. Denoting as y n = 1 the signal indicating the generated response xA n is predicted as the next utterance, Equation 2 is extended as: where α is a hyper-parameter.

Self-play Model Fine-tuning
Although supervised dialogue generation alone can be used to mimic human-like responses, it does not inherently target at understanding.Therefore, we 1 is directly taken from the dataset as it is difficult to generate high-quality utterances without any dialogue history.
further fine-tune Transmitter using reinforcement learning with the goal of maximizing mutual persona perception.Analogous to Lewis et al. (2017), we apply self-play to simulate the communication between two Transmitters, both of which have been trained as described in Section 3.1.
Specifically, we have the two Transmitters communicate with each other for several turns.One Transmitter serves as a user with the parameters frozen, while the other is a learnable agent.The parameter of the learnable agent, θ, is fine-tuned during the self-play.Without loss of generality, in our experiments, we let interlocutor A, who starts a conversation, be the user, and correspondingly B be the learnable agent.
Here we introduce some necessary formulations for modeling our problem with reinforcement learning.A state contains the persona and the dialogue history.For example, the state for B at turn n is defined as s B n = {w B , h B n }.An action a B n is the response to be generated.The action space is infinitely large as the response can be arbitrary long.Taking s B n as input, the parameter θ defines a policy p θ (a B n |s B n ), through which the learnable agent generates its response.
As illustrated in Figure 4, when it is B's turn to speak, B receives s B n and picks a B n according to the policy p θ .As for A, it receives s A n and generates the response x A * n to simulate a user.A and B alternately produce responses till the number of turns exceeds the given limit.Once a complete dialogue is generated, the reward is collected to optimize θ using policy gradient (Sutton et al., 1999).Denoting as R(a B n ) the reward B gets at turn n (more details are provided later), we can optimize it by maximizing the following objective: Applying likelihood ratio trick, θ is updated by ascending the following gradient: As aforementioned, the space of action a B n is infinite.In practice, REINFORCE algorithm (Williams, 1992) is leveraged to approximate Equation 5 by sampling a B n from policy p θ (a B n |s B n ).Furthermore, subtracting a baseline (Weaver and Tao, 2001), here the mean reward of a mini-batch, is applied on R(a B n ) to reduce variance.The agent samples tokens one by one through multinomial sampling over the output distribution of B, until the special token [EOS] is sampled or exceeding the maximum allowed decoding step (e.g.32).Compared to beam search sampling, multinomial sampling provides more diversities.

Reward Shaping (RS)
As described in Section 1, we believe that a highquality chit-chat conversation should highlight both human language modeling and mutual persona perception.Bearing this in mind, we design three rewards to address language style, discourse coherence and mutual persona perception respectively.

RS.1 Language Style
The generated responses should conform to human language styles, which we believe can be evaluated by a pretrained language model (i.e.GPT).After length normalization, the score for a B n is given as: where a B n,t and a B n,<t have similar denotation as the previously mentioned x A n,t and x A n,<t .

RS.2 Discourse Coherence
The language score is evaluated individually, without considering the discourse coherence.However, a reasonable response should establish links in meaning with context, which is also an important aspect of humanlike responses.To take into account the discourse coherence, we employ the well-trained Next Utterance Predictor (mentioned in Section 3.1).The reward is given by the log probability of a B n being the next utterance of s B n : RS.3 Mutual Persona Perception RS.1 and RS.2 only steer the agent training process towards human-like responding.They do not explicitly encourage understanding between interlocutors.Therefore, we meticulously design the reward to characterize mutual persona perception.Contrast from RS.1 and RS.2, mutual persona perception is a long-term goal throughout the whole dialogue, meaning that the effect of current action might only play out some time later.For instance, receiving "what are your hobbies?" from B, it is highly likely that A's response is relevant to A's hobbies.This suggests that, not only A's response but also B's initial question contributes to mutual persona perception.Denoting as γ the discount factor indicating how far ahead B looks, the reward of mutual persona perception for a B n is defined as: where r(a B n ) is the persona perception score that B obtains in n-th turn, and r(x A * k ) is defined likewise.r(a B n ) can be computed using a score function: In P 2 BOT, the score function comes from Receiver, which will be elaborated in Section 4. The final reward R(a B n ) for a B n is a weighted sum of the rewards listed above: where λ 1 , λ 2 and λ 3 are hyper-parameters.

Receiver
Receiver is devised to measure the proximity between the built impressions and the actual personas, implemented by negative sampling.Specifically, in training, we randomly sample a persona distractor w Z .Receiver is trained to identify the real persona w A from {w A , w Z }.In inference, for each utterance, Receiver is responsible for providing a reasonable relevance score, to model our proposed mutual persona perception.The score subsequently joins the self-play fine-tuning on Transmitter as part of the rewards, as in Equation 8.

Training
As illustrated in Figure 5, Receiver contains two different encoders for impression and persona respectively.Initialized by BERT (Devlin et al., 2019), both encoders provide deep contextualized representations for each token.Then we average all the representations, yielding a fixed d-dimensional vector for one sentence.In this way, feeding into the impression encoder consecutively, we obtain the impression encoding H A ∈ R N ×d .The persona encoding W ∆ ∈ R L×d is produced likewise, where ∆ ∈ {A, Z}.The relevance score matrix U ∆ is computed via the scaled dot product (Vaswani et al., 2017): In essence, Receiver is expected to capture finegrained correlations between the persona and the dialogue.However, we do not have access to the golden fine-grained correlations.The only thing we know is that, compared with W Z , H A is more correlated to W A .Since the comparison is at a coarse granularity, we gather U ∆ into the cumulative score c ∆ through an aggregate function Agg, as shown in Figure 5.To encourage c A while at the same time depress c Z , we design a marginal loss L rec , which makes c A larger than c Z by a margin m.Moreover, considering that an utterance generally relates to zero or one profile, L 1 regularization is enforced to make U ∆ sparse.Combining all of these, the training loss for Receiver is: where β is a hyper-parameter for penalty.
As for Agg, one straightforward way is to average over all positions of U ∆ .However, it maximizes every entry in U A , including all those that should not be activated (e.g.relevance scores between unrelated profile sentences and utterances), introducing unnecessary noise into the training of Transmitter.To alleviate the problem, we choose to implement Agg as a controllable weighted function, which summarizes U ∆ n,: as: where temperature τ > 0 is a tunable parameter (Hinton et al., 2015) controlling the evolution of Agg.In the beginning, Agg behaves close to average pooling.As τ anneals, Agg gradually focuses more on the highest relevance score.In this way, noise reduces as training goes on.Finally, c ∆ is given by: Agg(U ∆ n,: ).(14)

Inference
Given x A n and w A , Receiver employs the following function to obtain x A n 's persona perception score, further modeling mutual persona perception as in Equation 9: where H A n,: and W A are the impression encoding and persona encoding for x A n and w A respectively.

Experiment
We conducted experiments on the dataset PERSONA-CHAT, assessing P 2 BOT using both automatic metrics and human evaluations.To verify the effectiveness of our proposed mutual persona perception, we perform a thorough model analysis in Section 5.3.Finally, we probe Receiver's capability on perceiving persona in Section 5.4.

Implementation Details
PERSONA-CHAT dataset contains 8,939 / 1,000 multi-turn dialogues conditioned on 1,155 / 100 personas for train / dev.Each persona is described with at least 5 profile sentences.To make it more challenging, PERSONA-CHAT also provides revised personas by rephrasing, generalizing or specializing the original ones.For example, "I am overweight." is revised from "I weight 300 pounds.".
Our implementation was based on PyTorch (Paszke et al., 2019), ParlAI (Miller et al., 2017), and HuggingFace's transformers library (Wolf et al., 2019a).We used Adam (Kingma and Ba, 2015) optimizer with a learning rate of 6.25e-5 for both Receiver and Transmitter in supervised learning.In the training of Receiver, τ reduced linearly from 10 to 0.5.In the self-play phase of Transmitter, the learning rate was set as 1e-6.The hyperparameters m, α, β, γ, λ 1 , λ 2 and λ 3 were set as 0.4, 0.1, 1e-4, 0.5, 0.4, 0.1 and 0.5 respectively.The supervised training of Transmitter lasted for 2 epochs, and the self-play fine-tuning comprised 2000 dialogues, where the number of turns was 3. The beam search size was set as 2.

Methods Comparison
Our baselines fall into three categories: retrievalbased, generative-based and pretrain-finetunebased models.Among the retrieval-based baselines, KV Profile Memory (Zhang et al., 2018b) was the official baseline which employed the memory network along with profile information, and Dually Interactive Matching Network (Gu et al., 2019) proposed a dual matching architecture to match between the responses and their corresponding contexts.Language Model, Generative Profile Memory (Zhang et al., 2018b) and SEQ2SEQ with attention mechanism (Bahdanau et al., 2015) were implemented as generative baselines for dialogue generation.The remaining methods were all pretrain-finetune-based. Transfertransfo (Wolf et al., 2019b) 3 achieved the state-of-the-art performance on automatic metrics, while Lost In Conversation4 topped the human evaluations (Dinan et al., 2019).Analogous to our approach, they employed the pretrained language model GPT to initialize their models, and then fine-tuned it on the dataset.
Table 1 shows the experimental results on automatic metrics.Following Zhang et al. (2018b), we reported the official automatic metrics to evaluate the methods: Hits@1, Perplexity (ppl) and F1.Given 20 response candidates, Hits@1 is the probability that the real response ranks the highest according to the model.Perplexity measures the negative log likelihood of the correct sequence output by the model, lower values indicating better performance.F1 is the harmonic mean of word-level precision and recall.As observed, our approach outperforms almost all baselines and achieves new state-of-the-art performance on ppl and F1, with highly competitive performance on Hits@1.In the revised mode, our approach still achieves the best performance, obtaining a relative improvement of 13.4% on F1 against the strongest baseline.It is worth noting that we also tried to employ F1 as the reward, but the result is far from satisfactory.
As mentioned in Dinan et al. (2019), no automatic metric is perfect for evaluating such an opendomain task.Hence, we also performed crowdsourced human evaluations on the state-of-the-art baselines (i.e.Transfertransfo & Lost In Conversation) and our proposed P 2 BOT.Concretely, on the original dev set, we randomly sampled 200 responses generated by these methods and asked each worker to rate them.The rating ranges from 1  to 4. 1 means the response is good only in terms of grammar and sentence structure; 2 means in addition to valid grammar, the response is also coherent with the context; 3 means the coherent response is meanwhile interesting and informative, instead of just a simple response like Yes; And 4 means the response is consistent with the persona of the interlocutor, which is of extreme importance for the task of reflecting whether the model can effectively utilize the persona information.As shown in Table 2, the results are consistent with the automatic evaluation results, demonstrating the superiority of P 2 BOT against the baselines.We also conducted Wilcoxon signed-rank tests between our method and the baselines and the results show the improvements are significant with p < 0.05.

Model Analysis
Variant Analysis We conducted variant analysis on P 2 BOT to investigate the influence of RS.1, RS.2 and RS.3.Another metric BLEU (Papineni et al., 2002), which evaluates the quality of response, was introduced to make the analysis more comprehensive.We show the variant analysis results in Table 3, where P 2 BOT-S is the variant of P 2 BOT which is trained only in the supervised setting.As expected, the results on Hits@1 validate the important role of the auxiliary task.Across all the variants, the gains in BLEU and F1 are very small, revealing the difficulty in improving them.Nevertheless, solely by adding RS.3, we obtained a 25% relative improvement on BLEU, indicating the effectiveness of our proposed mutual persona i.I love new kids on the block.
ii.I was born in the early 80's.
iii.I also like old school hip hop.iv.My favorite toy as a child as my lite brite.
i.I am a blonde girl with really short hair.
ii.I love wearing skinny jeans and leggings.
iii.I 'm rather skinny as I like to stay in shape.iv.My favorite hobbies are listening to music and playing video games.perception.Similar conclusions can be drawn from the trend of F1.
Case Study For a more comprehensive comparison, we show in Table 4 some randomly sampled responses of different methods.The results suggest the responses generated by our approach are more human-like.As observed, benefiting from our proposed mutual persona perception, the responses of P 2 BOT are more consistent, engaging and informative.For instance, in the last example in Table 4, the response "I'm busy with my robot project" explicates why the speaker does not exercise, meanwhile revealing that he is working on the robot, as depicted in his persona.
Error Analysis Though our approach works well in most cases, we observed that the self-play simulation might fall into repeated cycles after rounds of training, as the challenge mentioned by Li et al. (2016b).Another issue is that the bots sometimes ask redundant questions in our approach, which might be due to inappropriate hyperparameters in reward shaping.

Persona Perception Probing
Receiver plays an important role in our approach, and we are interested in its capability on perceiving persona.Therefore, we conducted experi-

Revised Persona Dialogue
Figure 6: Visualization of the relevance scores between a sampled dialogue and its corresponding revised persona.Deeper color means higher score.We omit some context due to space limitation.ments on a synthesized dataset.We constructed the dataset by sampling 31 persona distractors for each dialogue in PERSONA-CHAT.Two widely used ranking metrics were used to evaluate the performance: Hits@1 and Mean Reciprocal Rank (MRR).Hits@1 is the same metric as the one mentioned in Section 5.2, except that the candidate size is 32.Given a dialogue and the complete set of profile sentences, MRR is the average reciprocal ranks of the dialogue-relevant profile sentences.Two simple baselines Random and IR (Sordoni et al., 2015) were chosen for comparison.Table 5 shows the experimental results of different methods on the synthesized dataset.As observed, our approach achieved excellent results on both original and revised modes.For example, compared with the IR baseline, our approach achieved an absolute improvement of 26.3% on Hits@1 in the original mode.In addition, the surprising results in the revised mode further demonstrate Receiver's capability to perceive rephrased persona.
To further understand the trained Receiver, we visualize the relevance scores between a sampled dialogue and its corresponding revised persona in Figure 6.As illustrated, the relevance scores between related profile sentences and dialogue utterances are significantly higher.For example, the utterance "I volunteer at the local pool" from the interlocutor implies the profile "I love being in the water", and our Receiver successfully captures the relevance between them.

Related Work
Methods to build open-domain dialogue systems generally fall into two major categories: retrievalbased and generative-based.Retrieval-based methods retrieve response candidates and rank them based on the matching scores with the dialogue (Sordoni et al., 2015;Wu et al., 2017;Gu et al., 2019).Generative-based methods typically use SEQ2SEQ model as the backbone (Sutskever et al., 2014;Bahdanau et al., 2015;Serban et al., 2017;Wolf et al., 2019b), where the encoder extracts the information in an utterance and the decoder generates the response.Our work adopts a similar architecture.Besides supervised learning, researchers also explore reinforcement learning based methods.Lewis et al. (2017) applied reinforcement learning for negotiation dialogues and showed it outperforms supervised learning when negotiating with humans.Yang et al. (2018) proposed to generate dialogue responses by dual learning based domain adaptation.Zhang et al. (2018a) built a coherence model to provide the reward signal for penalizing dull responses.Liu et al. (2019) employed reinfrocement learning to learn an intermediate structure span.Our approach differs from this line of work in that we focus on improving personalized dialogues via mutual persona perception, which has not yet been explored before.
More recently, under the topic of dialogue personalizing, Zemlyanskiy and Sha (2018) proposed a post-processing method to re-rank candidates generated by beam search, while Olabiyi et al. (2019) employed adversarial approaches to solve the consistency problem on interlocutors' names.Madotto et al. (2019) applied meta-learning to quickly adapt to new speakers, and Tigunova et al. (2019) extracted user attributes from daily dialogues.Compared with them, our work enhances persona based dialogue generation from a novel perspective.
Furthermore, researchers explored to generate diverse responses conditioned on persona (Song et al., 2019(Song et al., , 2020)).Personalization in goal-oriented di-alogue systems has also received some attention (Joshi et al., 2017;Luo et al., 2019).The researches focus more on making the goal-oriented bots adjust the response according to different user profiles, while we aim to endow bots with persistent personalities.

Conclusion & Future Work
We propose P 2 BOT, a transmitter-receiver framework which explicitly models understanding between interlocutors.Under this framework, mutual persona perception is incorporated as a reward signal to achieve the personalized dialogue generation.Experiments on a large public dataset PERSONA-CHAT demonstrate the effectiveness of our approach.For future work, we would like to extend Receiver to conversational recommender systems.After turns of chatting, the agent should be able to infer the user's persona, based on which personalized contents can be recommended.
Hello how are you, I am new to the Springfield area.I bought my first home.I love to barbecue.I live in Springfield.I'm a writer.I weight 300 pounds.I am not healthy.I am a man.I like The Godfather.

Figure 2 :
Figure 2: The overview of P 2 BOT (see text).

Figure 3 :
Figure 3: The overall architecture of Transmitter."Block" is short for "Transformer Block".Arrows bridge the current block to subsequent blocks of its following layer.Position encoding is to incorporate position information into block by assigning an embedding for each absolute position in the sequence.Here we omit the architecture inside the block, and refer the readers to Vaswani et al. (2017) for more details.[MASK] tokens are ignored in the training objective.

Figure 4 :
Figure 4: The illustration of the self-play procedure.Arrows ⇒ represent the process of dialogue generation driven by Transmitter.Note that x A *1 is directly taken from the dataset as it is difficult to generate high-quality utterances without any dialogue history.

Figure 5 :
Figure 5: The overall architecture of Receiver (see text).
i. I've an allergy to water.ii.I'm a famous twitch streamer.iii.I helped design the game StarCraft.iv.I am working on building a robot to be my body double.CONTEXT Do you like clowns?They horrify me!What video game?I relax by knitting.Love it!Cool.I did hardcore exercise four hours everyday.* HUMAN No! My favorite toy as a kid was lite brite because I was so scared!I was playing Fallout.What do you look like?I' m blonde thin.That's a lot of exercising.I hate exercising.I just work on my robot most days .* I do not like them at all.I am a little old for them .I play RPGs.I like to wear skinny jeans and tees .I do not exercise at all.I'm busy with my robot project .* TRANSFERTRANSFO I do not have time for clowns.Do you like to listen to music ?My fav color is blue, I have a lot of friends in my group.I work at a computer company.I could make you an Android! * LOST IN CONVERSATION I love clowns.they are my favorite.I love HALO 3, what do you knit?That sounds like a lot of fun !

…
I enjoy death metal I volunteer at the local pool … in India , that is where I'm from I'm learning about computers It is very basic but helpful I find myself to be … I'm a student I listen to punk I love being in water I'm not from the U.S.

Table 1 :
Automatic evaluation results of different methods on the PERSONA-CHAT dataset.The standard deviation [σ] (across 5 runs) of P 2 BOT is also reported.All the results were evaluated on the dev set since the test set was not publicly available.

Table 2 :
Human evaluation results.

Table 3 :
Variant analysis results on PERSONA-CHAT revised mode, along with relative improvements (shown inside brackets) compared with P 2 BOT-S.BLEU refers to the cumulative 4-gram BLEU score."-Persona" means dialogue generation without personas; "-Next" ablates the auxiliary task mentioned in Section 3.1; "+ RS.1" means only using Language Style score as the reward in the self-play fine-tuning phase; " → + RS.2" means adding Discourse Coherence to the reward on the basis of RS.1; " → + RS.3" is equivalent to our proposed P 2 BOT.

Table 4 :
Sampled responses(*) by Human, P 2 BOT and the state-of-the-art baselines.

Table 5 :
Experimental results on Persona Perception.