Sketch-Fill-A-R: A Persona-Grounded Chit-Chat Generation Framework

Human-like chit-chat conversation requires agents to generate responses that are fluent, engaging and consistent. We propose Sketch- Fill-A-R, a framework that uses a persona-memory to generate chit-chat responses in three phases. First, it generates dynamic sketch responses with open slots. Second, it generates candidate responses by filling slots with parts of its stored persona traits. Lastly, it ranks and selects the final response via a language model score. Sketch-Fill-A-R outperforms a state-of-the-art baseline both quantitatively (10-point lower perplexity) and qualitatively (preferred by 55% in head-to-head single-turn studies and 20% higher in consistency in multi-turn user studies) on the Persona-Chat dataset. Finally, we extensively analyze Sketch-Fill-A-R’s responses and human feedback, and show it is more consistent and engaging by using more relevant responses and questions.


Introduction
Chit-chat is a rich domain that challenges machine learning models to express fluent natural language and to successfully interact with other agents. Chitchat stands in contrast to goal-oriented dialogue, such as when a customer has the explicit goal of booking a flight ticket. When agents communicate, they each have internal state (e.g., their knowledge, intent) and typically have limited knowledge of the state of other agents (Chen et al., 2017). As a result, human-like chit-chat requires agents to be fluent, engaging and consistent with what has been said and their persona (Zhang et al., 2018).
These requirements make learning generative chit-chat models a complex task. First, given an existing conversation history, there may be a large number of valid responses (Vinyals and Le, 2015). Hence, supervised learning of chit-chat models that cover a large number of topics and styles requires a significant amount of data (Zhou et al., 2018). Second, as conversations progress and more opportunities for contradiction arise, maintaining consistency becomes more difficult (Serban et al., 2016(Serban et al., , 2017. Third, engaging chit-chat responses follow conversational structures that are not captured well by perplexity (Dinan et al., 2019;Liu et al., 2016). Indeed, our human user studies show that both consistency and engagingness are only weakly correlated with perplexity, and fluency is not at all.
We propose Sketch-Fill-A-R, a dialogue agent framework that can learn to generate fluent, consistent and engaging chit-chat responses. Our key motivation is the hypothesis that human-like chit-chat responses often 1) follow common conversational patterns with insertions of agent-specific traits, and 2) condition explicitly on those persona traits.
Sketch-Fill-A-R decomposes response generation into three phases: sketching, filling and ranking, see Figure 1. First, Sketch-Fill-A-R dynamically generates a sketch response with slots, which enables it to learn response patterns that are compatible with many specific persona traits. Second, it generates candidate responses by filling in slots with words stored in memory. This enables Sketch-Fill-A-R's responses to adhere to its persona. Third, the candidate responses are ranked by perplexity under a pre-trained language model (LM), which encourages the final response (with lowest LM perplexity) to be fluent.
In sum, our contributions are as follows: • We describe Sketch-Fill-A-R and how its multi-phase generation process encourages fluency, consistency and engagingness.
• We show that Sketch-Fill-A-R significantly improves hold-out perplexity by ∼ 10 points on the Persona-Chat dataset over state-ofthe-art baselines.
• We show Sketch-Fill-A-R is rated higher on conversational metrics and preferred over baselines in single and multi-turn user studies.
• We extensively analyze Sketch-Fill-A-R's response statistics and human feedback, and show that it is more consistent by using a narrower set of responses, and more engaging, by asking more questions than baselines.

Related Work
Chit-chat Dialogue Dialogue agents such as Amazon Alexa, Apple Siri, and Google Home are commonplace today, and are mainly task-oriented: they help users achieve specific tasks. On the other hand, Microsoft XiaoIce (Zhou et al., 2018) is an example of an undirected chit-chat dialogue agent. Historically task-oriented dialogue systems are composed via components such as dialogue state tracking and natural language generation (Jurafsky and Martin, 2009). Even now, the natural language generation component often uses handcrafted templates and rules defined by domain experts that are filled via heuristics (Gao et al., 2019). More recently task-oriented dialogue systems have been trained end-to-end (Bordes et al., 2016), but these systems have specific user intents they aim to fulfill, and so represent a more constrained task. Early conversational dialogue systems such as ELIZA (Weizenbaum et al., 1966) and Alice (Wallace, 2009) were also based on hand-crafted rules and thus brittle. To alleviate this rigidity, more recent neural seq2seq models (Sutskever et al., 2014) are trained end-to-end (Vinyals and Le, 2015;Sordoni et al., 2015;Serban et al., 2017;Li et al., 2016). To help guide conversation (Ghazvininejad et al., 2018;Gopalakrishnan et al., 2019) incorporated knowledge-grounded datasets, while (Zhang et al., 2018) created the Persona-Chat dataset used in this work. Sketch-Fill-A-R dynamically generates slot sketches and bears resemblance to (Wu et al., 2019) which assumed data are structured domain-specific triplets and contexts follow templates. However, Sketch-Fill-A-R does not assume the personas and responses have rigid syntactic structure, and introduces a ranking procedure. Converse to our sketch-and-fill procedure, (Qian et al., 2017) train a model to select a persona trait and decode around the trait. Finally, (Welleck et al., 2018) also re-rank by scoring utterances with Natural Language Inference to improve consistency.
Neural Sequence Models Sketch-Fill-A-R extends a neural encoder-decoder structure (Sutskever et al., 2014) but is agnostic to the chosen form of encoder-decoder. In this work we use recurrent models and attention , which auto-regressively embed and generate sequences, but our framework is general, allowing non-recurrent encoders and decoders like Transformer networks with non-recurrent self-attention (Vaswani et al., 2017;Devlin et al., 2018) to be substituted for the recurrent encoder and decoder.
Sketch-Fill-A-R uses a simple memory module to store words from personas, which act as context for generation. Weston et al. (2014);Sukhbaatar et al. (2015) introduced learned Key-Value Memory Networks, while Kumar et al. (2016) introduced Dynamic Memory Nets for question-answering via an iterative attention over memory. Also, Sketch-Fill-A-R decodes responses using a re-ranking strategy based on language model scores, which complements strategies in (Kulikov et al., 2018).

Sketch-Fill-A-R
Our key motivation is to generate human-like chitchat responses that are conditioned on personarelevant information. Sketch-Fill-A-R generates chit-chat using a persona-memory to dynamically generate sketches that capture conversational patterns, and inserting persona-relevant information.
To set notation: capitals W, V, . . . denote matrices, i, j, k are vector-matrix indices and x, y, . . .  the output at time u is y u . We denote the conversation by x c t and persona trait words by x p t . Both input and output words x t , y u ∈ {0, 1} V are 1-hot vectors, where V denotes the vocabulary size. The vocabulary contains all unique words, punctuation and special symbols (e.g., EOS, @persona). x 0:T denotes a sequence (x 0 , . . . , x T ).
Formally, we aim to learn a response generation model that predicts words y u using a probability distribution P (y 0:U |x 0:T ; θ) over sequences of T words and N persona traits with R rare words. Here U is the output sequence length and θ are the model weights. We use deep neural networks, a model class that has recently seen success in language generation tasks .
Sketch-Fill-A-R uses several components to generate sketch responses: • An encoder h e 0:T = Enc (x 0:T ; θ) that computes hidden representations e t of the input.
that stores all rare words from persona traits (constructed by removing stop words).
• A language model P LM (x t+1 |x 0:t ; θ) that computes a distribution over next words.
• A sketch decoder that synthesizes both the encoded input and memory readouts, and predicts the next word y u in the sketch response.

Sketch Response Generation
Encoder We instantiate both encoder and decoder using recurrent neural networks. In this work, we use LSTMs (Hochreiter and Schmidhuber, 1997), although other choices are possible (Elman, 1990;. The encoder computes hidden states h 0:T ∈ R d hid auto-regressively: where e(x t ) are word-embedding representations of the raw input tokens Memory Module Sketch-Fill-A-R selects a subset of rare words, x p r from the persona traits by removing stop-words, punctuation, and other symbols. After encoding the input dialogue, Sketch-Fill-A-R does a memory readout using the final conversation encoder hidden state h conv T as a query: where r is a vector index over the rare word memory, σ is a softmax activation function creating attention weights p i ∈ R d hid , and C k are trainable embedding matrices where C k ∈ R V ×d hid .

Attention Decoder
The decoder is an LSTM which recursively computes hidden states h d u that are mapped into a distribution over output words: At decoding time u + 1 the decoder computes the next hidden state h d u+1 using the previous predicted word y u and decoder hidden state h d u , in addition to attention over the context of the response (the previous utterances and the agent's persona traits). The decoder projects [h e T , h mem ] down to size d hid and uses it as the initial hidden state of the decoder. W emb ∈ R d hid ×V is the transpose of the encoding embedding matrix and used to convert the decoding context to a word. The decoding context c u augments decoder hidden state h d u with attention vectors c conv u over encoded conversation hidden states and c pers u over encoded persona hidden states for additional information: where f is a tanh, W ac ∈ R 3 * d hid ×d hid , W a ∈ R d hid ×d hid and σ is the softmax activation function.
In Equations 9 and 10 the softmax is over the encoder time dimension and ·, · is an inner product.

Inference Reranking Strategy
Sketch-Fill-A-R trains the sketch-decoder outputs (Equation 7) by minimizing cross-entropy loss   with ground truths y * u . However, during inference, Sketch-Fill-A-R uses an iterative generateand-score approach to produce the final response: that may contain @persona tags.
2. For each sketch with tags, select the persona i * with the highest attention weight w u * ,i * (h c T ) from the first sketch tag location u * , and construct B candidate responses by filling each @persona slot with words selected from i * .
3. Compute the perplexity s b of all B candidate responses using a pre-trained language model: 4. Choose response b * = min b s b with the lowest LM-likelihood score as the final response.
For Sketch-Fill variants that do not use reranking to fill slots, we follow the methodology of (Wu et al., 2019) in using a memory pointer network in order to fill slots. For detail, see the Appendix.

Empirical Validation
To validate Sketch-Fill-A-R, we first show that it achieves better supervised learning performance than baselines on a chit-chat dialogue dataset.

Persona-Chat Dataset
We trained Sketch-Fill-A-R to generate single-turn agent responses on the Persona-Chat dataset (Zhang et al., 2018), which contains 10,907 dialogues. Here, a dialogue consists of multiple turns: a single turn contains the utterance of a single agent. We processed this dataset into training examples that each consist of the conversation history x c t , set of persona traits x p t of the model, and the ground truth sketch response y u . This process yielded 131,438 training examples. Rare words were identified by removing all punctuation and stop words from the set of persona traits (see Appendix for more information). Ground truth sketch responses were then constructed by replacing all rare word instances in ground truth responses with @persona tags.
Language Model Pre-training Sketch-Fill-A-R uses a Transformer-based GPT (Radford et al., 2018) pre-trained on the Books text corpus (Zhu et al., 2015) to rank candidate responses with filled @persona slots according to their LM-perplexity scores. For model details, see the Appendix.

Experimental Setup We compared 4 variations of Sketch-Fill-A-R with a strong baseline: 1
• Key-Value Memory Network (KVMemNet) (Zhang et al., 2018), • Sketch-Fill-A-R: SF + attention + reranking (Zhang et al., 2018) showed not only that models trained on Persona-Chat outperform models trained on other dialogue datasets (movies, Twitter) in engagingness but also that KVMemNet outperforms vanilla Seq2Seq on Persona-Chat. As a result we omit comparison with Seq2Seq. KVMemNet is the strongest of the few public baselines available to compare against on chitchat with personas. All Sketch-Fill-A-R models use language model reranking (see Section 3.2). All input tokens x c t , x p All models were trained by minimizing loss on the ground truth sketch response y * 0:U : y * u , log P (y u |x 0:T , y 0:u−1 ; θ) . (11) For training details, see the Appendix. The results are shown in Table 1. Sketch-Fill models outperform KVMemNet on validation perplexity, while using significantly fewer weights than KVMemNet. This suggests the structure of Sketch-Fill models fits well with chit-chat dialogue.

User Study and Qualitative Analysis
Although Sketch-Fill models perform well quantitatively, a crucial test is to evaluate how well they perform when judged by human users on conversational quality, which is not explicitly captured by perplexity. We performed single and multi-turn dialogue user studies to assess the quality of Sketch-Fill-A-R, rated along several dimensions: • Fluency: whether responses are grammatically correct and sound natural.
• Consistency: whether responses do not contradict the previous conversation.
• Engagingness: how well responses fit the previous conversation and how likely the conversation would continue.
Our definition of engagingness includes relevance, defined in pragmatics and relevance theory (Wilson and Sperber, 2002;Grice, 1991) as a statement leading to positive cognitive effect. However an engaging statement may be ironic (Sperber and Wilson, 1981), humorous, or specific to individuals. We also explore which qualities of Sketch-Fill-A-R's outputs are correlated with human ratings and perplexity scores. Our results suggest that: • Conditioning on persona-memory provides more consistency.
• Sketch-Fill-A-R poses more questions, which correlates with higher engagingness.
• Responses need to be fluent in order to be consistent or engaging. In addition, more consistent responses are more likely to be engaging.
• Perplexity is not correlated with high-quality responses.  Table 3: User study ratings of single-turn responses (score range where 1 is low and 5 is high). Each row shows ratings from a head-to-head experiment where responses from Sketch-Fill-A-R-variants and KVMemNet over 100 different conversations were shown to 5 human raters. Sketch-Fill with reranking show a small gain over KVMemNet on all qualitative metrics, but the variance in the ratings is high. Sketch-Fill variants without reranking perform much worse, due to their responses not being fluent, despite achieving low perplexity (see Figure 1).

Single-turn Experiments
The studies were conducted on 100 random examples sampled from the validation set, where each example was rated by 5 judges. Each example contained a conversation with multiple lines of history and a single KVMemNet or Sketch-Fill response. Judges came from English speaking countries and were calibrated with examples of good/bad responses in all metrics before judging.
The study was executed in two settings, finegrained, where the judges rated the responses on a scale from 1 (lowest) to 5 (highest) for each of the mentioned dimensions, and binary, where they chose which response best fit the conversation.
The results of the fine-grained survey are presented in Table 3, where each row corresponds to a separate head-to-head experiments in which the KVMemNet model was paired with one of the versions of Sketch-Fill-A-R. The study showed small gains on all metrics for all Sketch-Fill-A-R variations, however, the variance of results was high. We believe that this artifact could be caused by a number of factors, including subjective preferences of raters and potential ambiguities in the experi-ments description. We notice that Sketch-Fill and Sketch-Fill-A reach lower perplexity values than KVMemNet, but comparatively have lower evaluations across the board. Conversely, ranking models like Sketch-Fill-R and Sketch-Fill-A-R have higher scores on all metrics. We observe that the difference is due to the ranker giving more fluent outputs via better selection of persona words to use. Table 4 shows the results of the human study in a binary setting. In these experiments the base and attention-augmented versions of Sketch-Fill-A-R outperformed KVMemNet by a clear margin.
The following subsections present in-depth analyses of the human study. They focus on the Sketch-Fill-A-R model, since it yielded both the best perplexity and user study results.

Correlation between ratings
To study and better understand the reasoning behind the ratings assigned by annotators, we look at the correlation between the different dimensions in which responses where scored. Figure 5 shows Kernel-Density-Estimation plots of the data points and associated Pearson correlation coefficients ρ. The data shows weak (ρ = 0.397) to moderate (ρ = 0.462) correlation between fluency and consistency, and fluency and engagingness ratings respectively. The data shows ρ value of 0.670 between engagingness and consistency ratings, suggesting strong correlation between those dimensions. See appendix for more detailed information. The numbers were obtained on human ratings of the Sketch-Fill-A-R model, but comparable numbers were also obtained for the KVMemNet model. The mentioned results follow intuition, as fluency of a response is a notion that can be easily defined and identified. On the other hand consistency and engagingness are ambiguous, and (possibly) partially overlapping, concepts.
To associate quantitative metrics from    Table 8: Multi-turn user study ratings (score range 1 (lowest) -5 (highest)). We collected 30 conversations with 20 turns between human raters and models. KVMemNet is more fluent, but Sketch-Fill-A-R is more engaging and significantly more consistent.  across different dimensions. The study showed no correlation for fluency (ρ = -0.015), and weak correlations for consistency (ρ = -0.190) and engagingness (ρ = -0.147).

Model vocabulary analysis
To assess the diversity of responses generated by the models, we calculated the percentage of unique n-grams and full responses present in the model outputs. Table 2 presents these values for KVMemNet and Sketch-Fill-A-R computed on the full validation set. The numbers show that the KVMemNet model clearly outperforms our model in terms of generating diverse and unique outputs by a factor of 3-4x. How-ever, we hypothesize that this additional diversity may lead to lower engagingness scores.
Consistency over time In order to evaluate the models capacity to stay consistent with its previous statements, and thus implicitly its ability to utilize information present in the chat history, we compared how the consistency rating changed as the number of lines of the conversation increased. Figure 4 visualizes this metric both for our model and KVMemNet. In the case of both models, the consistency decreases as the chat history get longer, indicating that models have problems keeping track of their previous statements. When analyzing the linear trend we noticed that the decrease in performance is slower for the Sketch-Fill-A-R model. We hypothesize that this effect can be partially caused by the high diversity of sequences generated by the KVMemNet, which in turn affects the models ability to generate consistent conversation.

Effect of question responses (See et al., 2019)
note that for a conversation to be engaging, responses in chit-chat dialogue should be a mix of statements and questions, where the model inquires about certain traits and information of the other agent. We expand on this by evaluating the effect of a question's presence in the response has on the ratings coming from the judges. The results are presented in Figure 4c. The study showed that there is a strong correlation between the model asking a question and the users rating the response as Sketch-Fill-A-R (middle). As conversation length increases (more dialogue turns) both models become less consistent, but KVMemNet degrades faster than Sketch-Fill-A-R. Right: impact of response containing a question on human ratings. Responses including questions tend to receive higher human ratings. more engaging. Asking questions has a small but positive influence on engagingness and fluency.
To further analyze this aspect, we measured the frequency of questions in the set of 100 responses coming from the Sketch-Fill-A-R and KVMemNet models. We found that our model produced 49 question responses out of which 25 had both a statement and a question. In the same setting the KVMemNet produced 15 questions out of which only 1 contained a statement and a question. This insight could explain the gains on the engagingness ratings found by our human study.

Multi-turn User Study
To evaluate both models in the more challenging multi-turn setting, we collected 30 conversations that lasted 20 turns, between each model and human users. Users were asked to score their conversations with the models on a scale from 1 (lowest) to 5 (highest) across the same dimensions as in the single-turn experiments. Table 8 shows the human ratings for both Sketch-Fill-A-R and KVMemNet. Both were judged as less fluent (scores ≈ 3) than in the single-turn case (scores ≥ 4). This is likely due to the models having to respond to a range of conversation histories unseen during training.
Notably, Sketch-Fill-A-R beat KVMemNet on consistency by a significantly larger margin (3.72 vs 2.15) than in the single-turn setting. This suggests that Sketch-Fill-A-R benefits from conditioning response generation on its persona-memory thus adhering more closely to responses compatible with its persona.
Further, Sketch-Fill-A-R is more engaging. This suggests that in the multi-turn setting, there also is a positive correlation between engagingness and con-sistency as in the single-turn case (see Appendix): consistent models can be more engaging as well. Table 7 shows an example of KVMemNet's inconsistency. While every model utterance is fluent individually, KVMemNet noticeably contradicts itself in the context of previous utterances and frequently ignores the human responses (e.g "i do not have any myself" after "my little girl"). We believe the lack of structure inherent in models built on vanilla Seq2Seq make KVMemNet prone to this mistake. Table 6 shows Sketch-Fill-A-R conducts a more engaging conversation, with pertinent responses and questions. However, this structure can restrict Sketch-Fill-A-R, as sketches may be filled with incorrect persona traits (e.g "i love papaya food."). See the Appendix for more examples.

Discussion and Future Work
In our study we have identified several paths for future work. First, our results reinforce that perplexity does not strongly correlate with human judgment of response quality. It is crucial to develop an automated metric that correlates well with human judgment as human evaluation is expensive, time consuming, and prone to inconsistencies. Secondly, despite outperforming other models in the multiturn dialogue setting on consistency and engagement, our model has not reached human-like fluency. In order to demonstrate complex higher-level traits such as empathy, models must first master these lower-level abilities. Finally, correct use of rare words and proper nouns leads to higher human scores. Existing models are unable to deal with outof-vocabulary tokens and rare words gracefully, and incorporation of commonsense via methods like external knowledge bases will be useful.

Ethical Implications
During experiments, we identified a number of ethical implications for future work. The Persona-Chat dataset was noted by some raters to contain potentially inappropriate statements (e.g., "my wife spends all my money") and is based in US culture (e.g., food, music, cars, names). It also lacked content to fail gracefully when it didn't have an appropriate response (e.g., "I'm sorry I don't understand," "I don't know"). As such, learned model responses were occasionally insensitive and confusing to human users.

Model Architecture and Training Parameters
In all models we used single-layer LSTMs with hidden sizes of 300 throughout, and used GloVe embeddings of size 300. All Sketch-and-Fill models were trained with Adam initialized with learning rate 0.0001. We used batch sizes of 32. In single-turn experiments we used beam sizes of 7, and in multi-turn experiments we used beam sizes of 10. Dropout was applied for all models with probability 0.4.

Persona Preprocessing
Persona traits were pre-processed to remove stopwords. These were initialized with the defaults from NLTK and augmented with top commonly seen words in persona traits.

Number of Persona Tags
Training: 124,298 words were converted to persona tags out of 1,505,395 words total. Validation: 8,307 words were converted to persona tags out of 92,586 words total.

Global-to-Local Memory Pointer
Networks (Wu et al., 2019) construct a global memory distribution that acts as a mask over the memory and is concatenated with encoded dialogue history and memory information before initializing as the decoder's hidden state. They also construct a local memory pointer that identifies the word to retrieve. These auxiliary tasks are trained using cross-entropy loss. The global pointer label is defined G label = (g l 0 , ..., g l i ) as a vector where g l i is 1 if the word is expected in y * t and 0 otherwise. Using the same notation as in Section 3.1, we compute the global pointer as follows: This global pointer is used as a mask on the memory module before the decoding procedure e i = e i × g i . The local pointer label is used at every time step to identify which memory index (and thus word) to point to. If at y * t a persona trait is expected, L label t holds corresponding index, and is m otherwise.

Language Model Pretraining
OpenAI GPT consists of a 12 layer Transformer and is pre-trained on the BooksCorpus dataset.

Visualizing Model Attention
We visualize the three sets of attention weights in our model: the context weights in Figure 7, and memory weights and persona trait weights in Figure 8. Figure 7's x-axis shows a conversation ending with a question reflected by the user about hobbies. The response has high attention weights on hobbies and the user's own garden hobby in the previous context. Figure 8 (right) shows that in response to this hobbies question, attention is first distributed over hobby-related personas before converging on the mountain biking persona trait over time. Finally, we observe in Figure 8 (left) that the memory attention is most heavily weighted on coffee, which may explain why the coffee persona begins with such high weights.