A Persona-Based Neural Conversation Model

We present persona-based models for handling the issue of speaker consistency in neural response generation. A speaker model encodes personas in distributed embeddings that capture individual characteristics such as background information and speaking style. A dyadic speaker-addressee model captures properties of interactions between two interlocutors. Our models yield qualitative performance improvements in both perplexity and BLEU scores over baseline sequence-to-sequence models, with similar gains in speaker consistency as measured by human judges.


Introduction
As conversational agents gain traction as user interfaces, there has been growing research interest in training naturalistic conversation systems from large volumes of human-to-human interactions (Ritter et al., 2011;Sordoni et al., 2015;Vinyals and Le, 2015;Li et al., 2016).One major issue for these data-driven systems is their propensity to select the response with greatest likelihood-in effect a consensus response of the humans represented in the training data.Outputs are frequently vague or non-committal (Li et al., 2016), and when not, they can be wildly inconsistent, as illustrated in Table 1.
In this paper, we address the challenge of consistency and how to endow data-driven systems with the coherent "persona" needed to model humanlike behavior, whether as personal assistants, per- sonalized avatar-like agents, or game characters. 1or present purposes, we will define PERSONA as the character that an artificial agent, as actor, plays or performs during conversational interactions.A persona can be viewed as a composite of elements of identity (background facts or user profile), language behavior, and interaction style.A persona is also adaptive, since an agent may need to present different facets to different human interlocutors depending on the interaction.
Fortunately, neural models of conversation generation (Sordoni et al., 2015;Shang et al., 2015;Vinyals and Le, 2015;Li et al., 2016) provide a straightforward mechanism for incorporating personas as embeddings.We therefore explore two per-sona models, a single-speaker SPEAKER MODEL and a dyadic SPEAKER-ADDRESSEE MODEL, within a sequence-to-sequence (SEQ2SEQ) framework (Sutskever et al., 2014).The Speaker Model integrates a speaker-level vector representation into the target part of the SEQ2SEQ model.Analogously, the Speaker-Addressee model encodes the interaction patterns of two interlocutors by constructing an interaction representation from their individual embeddings and incorporating it into the SEQ2SEQ model.These persona vectors are trained on human-human conversation data and used at test time to generate personalized responses.Our experiments on an open-domain corpus of Twitter conversations and dialog datasets comprising TV series scripts show that leveraging persona vectors can improve relative performance up to 20% in BLEU score and 12% in perplexity, with a commensurate gain in consistency as judged by human annotators.

Related Work
This work follows the line of investigation initiated by Ritter et al. (2011) who treat generation of conversational dialog as a statistical machine translation (SMT) problem.Ritter et al. (2011) represents a break with previous and contemporaneous dialog work that relies extensively on hand-coded rules, typically either building statistical models on top of heuristic rules or templates (Levin et al., 2000;Young et al., 2010;Walker et al., 2003;Pieraccini et al., 2009;Wang et al., 2011) or learning generation rules from a minimal set of authored rules or labels (Oh and Rudnicky, 2000;Ratnaparkhi, 2002;Banchs and Li, 2012;Ameixa et al., 2014;Nio et al., 2014;Chen et al., 2013).More recently (Wen et al., 2015) have used a Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) to learn from unaligned data in order to reduce the heuristic space of sentence planning and surface realization.
The SMT model proposed by Ritter et al., on the other hand, is end-to-end, purely data-driven, and contains no explicit model of dialog structure; the model learns to converse from human-to-human conversational corpora.Progress in SMT stemming from the use of neural language models (Sutskever et al., 2014;Gao et al., 2014;Bahdanau et al., 2015;Luong et al., 2015) has inspired efforts to extend these neural techniques to SMT-based conversational response generation.Sordoni et al. (2015) augments Ritter et al. (2011) by rescoring out-puts using a SEQ2SEQ model conditioned on conversation history.Other researchers have recently used SEQ2SEQ to directly generate responses in an end-to-end fashion without relying on SMT phrase tables (Serban et al., 2015;Shang et al., 2015;Vinyals and Le, 2015).Serban et al. (2015) propose a hierarchical neural model aimed at capturing dependencies over an extended conversation history.Recent work by Li et al. (2016) measures mutual information between message and response in order to reduce the proportion of generic responses typical of SEQ2SEQ systems.Yao et al. (2015) employ an intention network to maintain the relevance of responses.
Modeling of users and speakers has been extensively studied within the standard dialog modeling framework (e.g., (Wahlster and Kobsa, 1989;Kobsa, 1990;Schatztnann et al., 2005;Lin and Walker, 2011)).Since generating meaningful responses in an open-domain scenario is intrinsically difficult in conventional dialog systems, existing models often focus on generalizing character style on the basis of qualitative statistical analysis (Walker et al., 2012;Walker et al., 2011).The present work, by contrast, is in the vein of the SEQ2SEQ models of Vinyals and Le (2015) and Li et al. (2016), enriching these models by training persona vectors directly from conversational data and relevant side-information, and incorporating these directly into the decoder.
3 Sequence-to-Sequence Models Given a sequence of inputs X = {x 1 , x 2 , ..., x n X }, an LSTM associates each time step with an input gate, a memory gate and an output gate, respectively denoted as i t , f t and o t .We distinguish e and h where e t denotes the vector for an individual text unit (for example, a word or sentence) at time step t while h t denotes the vector computed by the LSTM model at time t by combining e t and h t−1 .c t is the cell state vector at time t, and σ denotes the sigmoid function.Then, the vector representation h t for each time step t is given by: where In SEQ2SEQ generation tasks, each input X is paired with a sequence of outputs to predict: Y = {y 1 , y 2 , ..., y n Y }.The LSTM defines a distribution over outputs and sequentially predicts tokens using a softmax function: where f (h t−1 , e yt ) denotes the activation function between h t−1 and e yt .Each sentence terminates with a special end-of-sentence symbol EOS.In keeping with common practices, inputs and outputs use different LSTMs with separate parameters to capture different compositional patterns.
During decoding, the algorithm terminates when an EOS token is predicted.At each time step, either a greedy approach or beam search can be adopted for word prediction.

Personalized Response Generation
Our work introduces two persona-based models: the Speaker Model, which models the personality of the respondent, and the Speaker-Addressee Model which models the way the respondent adapts their speech to a given addressee -a linguistic phenomenon known as lexical entrainment (Deutsch and Pechmann, 1982).

Notation
For the response generation task, let M denote the input word sequence (message) M = {m 1 , m 2 , ..., m I }.R denotes the word sequence in response to M , where R = {r 1 , r 2 , ..., r J , EOS} and J is the length of the response (terminated by an EOS token).r t denotes a word token that is associated with a K dimensional distinct word embedding e t .V is the vocabulary size.

Speaker Model
Our first model is the Speaker Model, which models the respondent alone.This model represents each individual speaker as a vector or embedding, which encodes speaker-specific information (e.g., dialect, register, age, gender, personal information) that influences the content and style of her responses.Note that these attributes are not explicitly annotated, which would be tremendously expensive for our datasets.Instead, our model manages to cluster users along some of these traits (e.g., age, country of residence) based on the responses alone.
Figure 1 gives a brief illustration of the Speaker Model.Each speaker i ∈ [1, N ] is associated with a user-level representation v i ∈ R K×1 .As in standard SEQ2SEQ models, we first encode message S into a vector representation h S using the source LSTM.Then for each step in the target side, hidden units are obtained by combining the representation produced by the target LSTM at the previous time step, the word representations at the current time step, and the speaker embedding v i : where W ∈ R 4K×3K .In this way, speaker information is encoded and injected into the hidden layer at each time step and thus helps predict personalized responses throughout the generation process.The Speaker embedding {v i } is shared across all conversations that involve speaker i. {v i } are learned by back propagating word prediction errors to each neural component during training.
Another useful property of this model is that it helps infer answers to questions even if the evidence is not readily present in the training set.This is important as the training data does not contain explicit information about every attribute of each user (e.g., gender, age, country of residence).The model learns speaker representations based on conversational content produced by different speakers, and speakers producing similar responses tend to have similar embeddings, occupying nearby positions in the vector space.This way, the training data of speakers nearby in vector space help increase the generalization capability of the speaker model.For example, consider two speakers i and j who sound distinctly British, and who are therefore close in speaker embedding space.Now, suppose that, in the training data, speaker i was asked Where do you live? and responded in the UK.Even if speaker j was never asked the same question, this answer can help influence a good response from speaker j, and this without explicitly labeled geo-location information.

Speaker-Addressee Model
A natural extension of the Speaker Model is a model that is sensitive to speaker-addressee interaction patterns within the conversation.Indeed, speaking style, register, and content does not vary only with the identity of the speaker, but also with that of the addressee.For example, in scripts for the TV series Friends used in some of our experiments, the character Ross often talks differently to his sister Monica than to Rachel, with whom he is engaged in an on-again off-again relationship throughout the series.
The proposed Speaker-Addressee Model operates as follows: We wish to predict how speaker i would respond to a message produced by speaker j.
Similarly to the Speaker model, we associate each speaker with a K dimensional speaker-level representation, namely v i for user i and v j for user j.We obtain an interactive representation V i,j ∈ R K×1 by linearly combining user vectors v i and v j in an attempt to model the interactive style of user i towards user j, where W 1 , W 2 ∈ R K×K .V i,j is then linearly incorporated into LSTM models at each step in the target: V i,j depends on both speaker and addressee and the same speaker will thus respond differently to a message from different interlocutors.One potential issue with Speaker-Addressee modelling is the difficulty involved in collecting a large-scale training dataset in which each speaker is involved in conversation with a wide variety of people.
Like the Speaker Model, however, the Speaker-Addressee Model derives generalization capabilities from speaker embeddings.Even if the two speakers at test time (i and j) were never involved in the same conversation in the training data, two speakers i and j who are respectively close in embeddings may have been, and this can help modelling how i should respond to j.

Decoding and Reranking
For decoding, the N-best lists are generated using the decoder with beam size B = 200.We set a maximum length of 20 for the generated candidates.Decoding operates as follows: At each time step, we first examine all B × B possible next-word candidates, and add all hypothesis ending with an EOS token to the N-best list.We then preserve the top-B unfinished hypotheses and move to the next word position.
To deal with the issue that SEQ2SEQ models tend to generate generic and commonplace responses such as I don't know, we follow Li et al. (2016) by reranking the generated N-best list using a scoring function that linearly combines a length penalty and the log likelihood of the source given the target: where p(R|M, v) denotes the probability of the generated response given the message M and the respondent's speaker ID. |R| denotes the length of the target and γ denotes the associated penalty weight.We optimize γ and λ on N-best lists of response candidates generated from the development set using MERT (Och, 2003) by optimizing BLEU.To compute p(M |R), we train an inverse SEQ2SEQ model by swapping messages and responses.We trained standard SEQ2SEQ models for p(M |R) with no speaker information considered.

Twitter Persona Dataset
Data Collection Training data for the Speaker Model was extracted from the Twitter FireHose for the six-month period beginning January 1, 2012.We limited the sequences to those where the responders had engaged in at least 60 (and at most 300) 3-turn conversational interactions during the period, in other words, users who reasonably frequently engaged in conversation.This yielded a set of 74,003 users who took part in a minimum of 60 and a maximum of 164 conversational turns (average: 92.24, median: 90).The dataset extracted using responses by these "conversationalists" contained 24,725,711 3-turn sliding-window (context-message-response) conversational sequences.
In addition, we sampled 12000 3-turn conversations from the same user set from the Twitter Fire-Hose for the three-month period beginning July 1, 2012, and set these aside as development, validation, and test sets (4000 conversational sequences each).Note that development, validation, and test sets for this data are single-reference, which is by design.Multiple reference responses would typically require acquiring responses from different people, which would confound different personas.
Training Protocols We trained four-layer SEQ2SEQ models on the Twitter corpus following the approach of (Sutskever et al., 2014).Details are as follows: • 4 layer LSTM models with 1,000 hidden cells for each layer.• Batch size is set to 128.
• Learning rate is set to 1.0.
• Parameters are initialized by sampling from the uniform distribution [−0.1, 0.1].• Gradients are clipped to avoid gradient explosion with a threshold of 5. • Vocabulary size is limited to 50,000.
• Dropout rate is set to 0.2.
Source and target LSTMs use different sets of parameters.We ran 14 epochs, and training took roughly a month to finish on a Tesla K40 GPU machine.
As only speaker IDs of responses were specified when compiling the Twitter dataset, experiments on this dataset were limited to the Speaker Model.

Twitter Sordoni Dataset
The Twitter Persona Dataset was collected for this paper for experiments with speaker ID information.To obtain a point of comparison with prior state-of-the-art work (Sordoni et al., 2015;Li et al., 2016), we measure our baseline (non-persona) LSTM model against prior work on the dataset of (Sordoni et al., 2015), which we call the Twitter Sordoni Dataset.We only use its test-set portion, which contains responses for 2114 context and messages.It is important to note that the Sordoni dataset offers up to 10 references per message, while the Twitter Persona dataset has only 1 reference per message.Thus BLEU scores cannot be compared across the two Twitter datasets (BLEU scores on 10 references are generally much higher than with 1 reference).Details of this dataset are in (Sordoni et al., 2015).

Television Series Transcripts
Data Collection For the dyadic Speaker-Addressee Model we used scripts from the American television comedies Friends2 and The Big Bang Theory,3 available from Internet Movie Script Database (IMSDb). 4We collected 13 main characters from the two series in a corpus of 69,565 turns.We split the corpus into training/development/testing sets, with development and testing sets each of about 2,000 turns.
Training Since the relatively small size of the dataset does not allow for training an open domain dialog model, we adopted a domain adaption strategy where we first trained a standard SEQ2SEQ System BLEU MT baseline (Ritter et al., 2011) 3.60% Standard LSTM MMI (Li et al., 2016) 5.26% Standard LSTM MMI (our system) 5.82% Human 6.08% Table 2: BLEU on the Twitter Sordoni dataset (10 references).We contrast our baseline against an SMT baseline (Ritter et al., 2011), and the best result (Li et al., 2016) on the established dataset of (Sordoni et al., 2015).The last result is for a human oracle, but it is not directly comparable as the oracle BLEU is computed in a leave-one-out fashion, having one less reference available.We nevertheless provide this result to give a sense that these BLEU scores of 5-6% are not unreasonable.models using a much larger OpenSubtitles (OSDb) dataset (Tiedemann, 2009), and then adapting the pre-trained model to the TV series dataset.
The OSDb dataset is a large, noisy, open-domain dataset containing roughly 60M-70M scripted lines spoken by movie characters.This dataset does not specify which character speaks each subtitle line, which prevents us from inferring speaker turns.Following Vinyals et al. (2015), we make the simplifying assumption that each line of subtitle constitutes a full speaker turn. 5We trained standard SEQ2SEQ models on OSDb dataset, following the protocols already described in Section 5.1.We run 10 iterations over the training set.
We initialize word embeddings and LSTM parameters in the Speaker Model and the Speaker-Addressee model using parameters learned from OpenSubtitles datasets.User embeddings are randomly initialized from [−0.1, 0.1].We then ran 5 additional epochs until the perplexity on the development set stabilized.

Evaluation
Following (Sordoni et al., 2015;Li et al., 2016) we used BLEU (Papineni et al., 2002) for parameter tuning and evaluation.BLEU has been shown to correlate well with human judgment on the response generation task, as demonstrated in (Galley et al., 2015).Besides BLEU scores, we also report perplexity as an indicator of model capability.

Baseline
Since our main experiments are with a new dataset (the Twitter Persona Dataset), we first show that our LSTM baseline is competitive with the state-of-
Our baseline is simply our implementation of the LSTM-MMI of (Li et al., 2016), so results should be relatively close to their reported results.Table 2 summarizes our results against prior work.We see that our system actually does better than (Li et al., 2016), and we attribute the improvement to a larger training corpus, the use of dropout during training, and possibly to the "conversationalist" nature of our corpus.

Results
We first report performance on the Twitter Persona dataset.Perplexity is reported in Table 3.We observe about a 10% decrease in perplexity for the Speaker model over the standard SEQ2SEQ model.In terms of BLEU scores (Table 4), a significant performance boost is observed for the Speaker model over the standard SEQ2SEQ model, yielding an increase of 21% in the maximum likelihood (MLE) setting and 11.7% for mutual information setting (MMI).In line with findings in (Li et al., 2016), we observe a consistent performance boost introduced by the MMI objective function over a standard SEQ2SEQ model based on the MLE objective function.It is worth noting that our persona models are more beneficial to the MLE models than to the MMI models.This result is intuitive as the persona models help make Standard LSTM MLE outputs more informative and less bland, and thus make the use of MMI less critical.
For the TV Series dataset, perplexity and BLEU scores are respectively reported in Table 5 and Table 6.As can be seen, the Speaker and Speaker-Addressee models respectively achieve perplexity values of 25.4 and 25.0 on the TV-series dataset,

Model
Standard LSTM Speaker Model Speaker-Addressee Model Perplexity 27.3 25.4 (−7.0%) 25.0 (−8.4%)  7.0% and 8.4% percent lower than the correspondent standard SEQ2SEQ models.In terms of BLEU score, we observe a similar performance boost as on the Twitter dataset, in which the Speaker model and the Speaker-Addressee model outperform the standard SEQ2SEQ model by 13.7% and 10.6%.By comparing the Speaker-Addressee model against the Speaker model on the TV Series dataset, we do not observe a significant difference.We suspect that this is primarily due to the relatively small size of the dataset where the interactive patterns might not be fully captured.Smaller values of perplexity are observed for the Television Series dataset than the Twitter dataset, the perplexity of which is over 40, presumably due to the noisier nature of Twitter dialogues.

Qualitative Analysis
Diverse Responses by Different Speakers Table 7 represents responses generated by persona models in response to three different input questions.We randomly selected 10 speakers (without cherry-picking) from the original Twitter dataset.
We collected their user level representations from a speaker look-up table and integrated them into the decoding models.The model tends to generate specific responses for different people in response to the factual questions. 6able 8 shows responses generated from the Speaker-Addressee Model using the TV-series dataset.Interestingly, we regularly observe that this model is sensitive to the identity of the addressee, generating words specifically targeted at that addressee (e.g., her name).For example, the model produces Of course, I love you, Emily in response to an input from Emily.Also, the model generates Of course I love you.( kisses him), where the pronoun "him" accurately identifies the gender of the addressee.Human Evaluation We conducted a human evaluation of outputs from the Speaker Model, using a crowdsourcing service.Since we cannot expect crowdsourced human judges to know or attempt to learn the ground truth of Twitter users who are not well-known public figures, we designed our experiment to evaluate the consistency of outputs associated with the speaker IDs.To this end, we collected 24 pairs of questions for which we would expect responses to be consistent if the persona model is coherent.For example, responses to the questions What country do you live in? and What city do you live in? would be considered consistent if the answers were England and London respectively, but not if they were UK and Chicago.Similarly, the responses to Are you vegan or vegetarian? and Do you eat beef? are consistent if the answers generated are vegan and absolutely not, but not if they are vegan and I love beef.We collected 20 pairs of outputs for randomly-selected personas provided by the Speaker Model for each question pair (480 response pairs total).We also obtained the corresponding outputs from the baseline MMI-enhanced SEQ2SEQ system.
Since our purpose is to measure the gain in consistency over the baseline system, we presented the pairs of answers system-pairwise, i.e., 4 responses, 2 from each system, displayed on the screen, and asked judges to decide which of the two systems was more consistent.The position in which the system pairs were presented on the screen was randomized.The two systems were judged on 5-point zero-sum scale, assigning a score of 2 (-2) if one system was judged more (less) consistent than the other, and 1 (-1) if one was rated "somewhat" more (less) consistent.Ties were assigned a score of zero.Five judges rated each pair and their scores were averaged and remapped into 5 equal-width bins.After discarding ties, we found the persona model was judged either "more consistent" or "somewhat more consistent" in 56.7% of cases.If we ignore the "somewhat more consistent" judgments, the persona model wins in 6.1% of cases, compared with only 1.6% for the baseline model.It should be emphasized that the baseline model is a strong baseline, since it represents the consensus of all 70K Twitter users in the dataset7 .Table 9 illustrates how consistency is an emergent property of two arbitrarily selected users.The model is capable of discovering the relations between different categories of location such as London and the UK, Jakarta and Indonesia.However, the model also makes inconsistent response decisions, generating different answers in the second example in response to questions asking about age or major.Our proposed persona models integrate user embeddings into the LSTM, and thus can be viewed as encapsulating a trade-off between a persona-specific generation model and a general conversational model.
We have presented two persona-based response generation models for open-domain conversation generation.There are many other dimensions of speaker behavior, such as mood and emotion, that are beyond the scope of the current paper and must be left to future work.
Although the gains presented by our new models are not spectacular, the systems outperform our baseline SEQ2SEQ systems in terms of BLEU, perplexity, and human judgments of speaker consistency.We have demonstrated that by encoding personas in distributed representations, we are able to capture personal characteristics such as speaking style and background information.In the Speaker-Addressee model, moreover, the evidence suggests that there is benefit in capturing dyadic interactions.
Our ultimate goal is to be able to take the profile of an arbitrary individual whose identity is not known in advance, and generate conversations that accurately emulate that individual's persona in terms of linguistic response behavior and other salient characteristics.Such a capability will dramatically change the ways in which we interact with dialog agents of all kinds, opening up rich new possibilities for user interfaces.Given a sufficiently large training corpus in which a sufficiently rich variety of speakers is represented, this objective does not seem too far-fetched.

Figure 1 :
Figure 1: Illustrative example of the Speaker Model introduced in this work.Speaker IDs close in embedding space tend to respond in the same manner.These speaker embeddings are learned jointly with word embeddings and all other parameters of the neural model via backpropagation.In this example, say Rob is a speaker clustered with people who often mention England in the training data, then the generation of the token 'england' at time t = 2 would be much more likely than that of 'u.s.'.A non-persona model would prefer generating in the u.s.if 'u.s.' is more represented in the training data across all speakers.

Table 1 :
message Where do you live now?response I live in Los Angeles.message In which city do you live now?response I live in Madrid.message In which country do you live now?response England, you? message Where were you born? response I was born in Canada.message Where are you from?response England, you? message Where did you grow up? response I grew up in Texas.Inconsistent responses generated by a 4-layer SEQ2SEQ model trained on 25 million Twitter conversation snippets.

Table 3 :
Perplexity for standard SEQ2SEQ and the Speaker model on the Twitter Persona development set.

Table 4 :
BLEU on the Twitter Persona dataset (1 reference), for the standard SEQ2SEQ model and the Speaker model using as objective either maximum likelihood (MLE) or maximum mutual information (MMI).

Table 5 :
Perplexity for standard SEQ2SEQ and persona models on the TV series dataset.

Table 6 :
BLEU on the TV series dataset (1 reference), for the standard SEQ2SEQ and persona models.

Table 7 :
Responses generated by the baseline (LSTM-MMI) and the Speaker Model for ten randomly selected users, without cherry picking.# indicates poor-quality responses produced by the system.

Table 8 :
message Do you love me? baseline Of course, I love you.addressee: Emily, speaker: Sheldon response Of course, I love you, Emily.addressee: Emily, speaker: Penny response Of course, I love you, Emily.addressee: Leonard, speaker: Penny response Of course, I love you, Leonard.Responses to Do you love me? from the Speaker-Addressee model on the TV-series dataset using different addressees and speakers.

Table 9 :
User1 message Where is your hometown?response I'm from England.message Where are you from?response I'm from England.message In which city do you live now?response I live in London.message In which country do you live now?response I live in England.message What is your major?response Business.You? message What did you study in college?response I did business studies.message How old are you?response I'm 18. message What is your age? response I'm 18. User2 message Where is your hometown?response I live in Indonesia.message Where are you from?response England, you? message In which city do you live now?response I live in Jakarta.message In which country do you live now?response I live in Indonesia.message What is your major?response Business, you? message What did you study in college?response Psychology, you? message How old are you?response I'm 18. message What is your age? response I'm 16.Examples of speaker consistency and inconsistency generated by the Speaker Model