Like hiking? You probably enjoy nature: Persona-grounded Dialog with Commonsense Expansions

Existing persona-grounded dialog models often fail to capture simple implications of given persona descriptions, something which humans are able to do seamlessly. For example, state-of-the-art models cannot infer that interest in hiking might imply love for nature or longing for a break. In this paper, we propose to expand available persona sentences using existing commonsense knowledge bases and paraphrasing resources to imbue dialog models with access to an expanded and richer set of persona descriptions. Additionally, we introduce fine-grained grounding on personas by encouraging the model to make a discrete choice among persona sentences while synthesizing a dialog response. Since such a choice is not observed in the data, we model it using a discrete latent random variable and use variational learning to sample from hundreds of persona expansions. Our model outperforms competitive baselines on the PersonaChat dataset in terms of dialog quality and diversity while achieving persona-consistent and controllable dialog generation.


Introduction
Persona-grounded dialog generation is a 'chit-chat' dialog setup where a dialog agent is expected to communicate based on a given profile (Zhang et al., 2018). Many recent works have focused on a popular benchmark dataset for this task: PERSONA-CHAT (Zhang et al., 2018) that provides personas as a set of sentences along with each dialog (example in Figure 1). However, a careful analysis of state-of-the-art (SOTA) models reveals that they often struggle to respond to contexts that do not closely match given persona sentences, even when the implications might be obvious to a human.
For example, in Figure 1, the user asks an indirect question to the bot related to one of its persona sentences: I am an animal activist. SOTA1, which concatenates all persona sentences with dialog history and finetunes a pre-trained generative model (e.g. GPT2) (Wolf et al., 2019), fails to infer implied commonsense from the dialog context and conditions on an incorrect persona. SOTA2, which separately selects a persona sentence given the dialog history  manages to choose the correct persona but merely copies it as the final response. Neither approach is in general capable of responding to context that goes beyond what is explicitly mentioned in the available persona sentences, which limits consistent and interesting conversation. The goal of our model is to understand that being 'an animal activist' may imply that the person wants 'to make a difference' via their activity towards animals and synthesizes a context-consistent and engaging response.
In this paper, we focus on making personagrounded chatbots more consistent with personas and implicit dialog context. We present a framework to expand available persona sentences to their commonsense implications by using an existing commonsense knowledge base or paraphrasing resources (see Section 3). We endow our dialog model with these expansions directly rather than requiring the model to learn them from scratch for being context-consistent. We find that expansions derived from a commonsense knowledge base are more useful to provide engaging contextual information compared to other expansion sources.
We further propose a Common Sense and Persona Aligned Chatbot 1 (COMPAC) which models choices over the expanded persona set via a discrete latent random variable (See Section 4) as fine-grained persona grounding. Even though it is tractable to marginalize over all expansions, that would require a forward pass through the dialog generator for each outcome which is prohibitively slow during training. Instead, to accommodate hundreds of persona expansions, we train the model by optimizing a lower bound on the log-likelihood. We use amortized variational inference by approximating the true posterior using an inference network that eventually provides useful inductive bias. Particularly, we show that our Bayesian formulation for the fine-grained persona grounding was essential as simply providing expanded knowledge does not help the model generate better responses.
We also outperform competitive baselines in all dialog quality metrics as well as human evaluations which find COMPAC to be engaging and coherent. We demonstrate that COMPAC learns to be consistent with the dialog context with accurate persona grounding especially in the presence of commonsense expansions. Finally, we show that our model can reflect a change in response generation when a grounding persona is modified, indicating the possibility of controllable generation.

Persona Grounded Dialog
We use a popular benchmark dataset: PERSONA-CHAT (Zhang et al., 2018) for our personagrounded dialog generation task. It contains 10,907 dialogs between pairs of speakers where each speaker follows their own persona; 968 dialogs are used for validation and 1,000 for testing. Each speaker is described by 3-5 persona sentences. (e.g. 'I love the beach', 'My mother is a medical doctor'). Out of 1,155 total unique personas, 100 are used for validation and 100 for testing.
The task of persona-grounded dialog generation is: given a dialog history H and grounding persona sentences S, we must predict the next utterance x (Summary of notations in Table 1). Hence a dialog model should maximize the likelihood p(x|H, S). From the PERSONA-CHAT dataset, we use 131,438 utterances for training the dialog model, 15,602 for validation, and 15,024 for testing.

Persona Expansion
Persona sentences used in persona-grounded dialogs are instances of world events that often imply real-world consequences or richer information. For example, 'I love surfing' naturally implies that the person might be 'adventurous' or 'loves the outdoors'. Similarly, it also means that the person wants 'to go to the beach' regularly. Inferring these expansions from the original fact is non-trivial without additional commonsense knowledge. Zhang et al. (2018) found evidence that having human written interpretations of a persona sentence via rephrasing often helps in providing novel information in persona grounding. While obtaining such expansions by manual rewriting is expensive, here we explore two automatic ways to generate them at scale and separately evaluate them on the downstream dialog modeling task.

COMET
COMET (Bosselut et al., 2019) is a framework that generates rich and diverse commonsense expansions of a given world event. It is a finetuned version of a pre-trained GPT2 (Radford, 2018) model on a pre-existing commonsense knowledge graph such as ATOMIC  that can generate novel nodes (events) and edges (relations), as seen in Figure 2c. Specifically, ATOMIC provides tuples that belong to nine relation types spanning over cause-effect interrelations between events: oEffect, oReact, oWant, xAttr, xEffect, xIntent, xNeed, xReact, and xWant-where a prefix 'x' indicates an effect or cause on the person and 'o' denotes the same on others. While we tried COMET finetuned on an alternative commonsense knowledge base (e.g.) ConceptNet, not all of the expansions were appropriate to describe a persona, mainly because we observe that persona sentences are event-like ('I love to go to the beach') as opposed to concepts such as 'beach'. For more details on COMET and ATOMIC we refer the reader to (Bosselut et al., 2019) and  respectively.
We use the COMET framework to generate expansions for each persona sentence along the nine relation types that ATOMIC provides. We obtain different samples while decoding via beam search from COMET for more diverse and unique expansions, as shown in Figure 2c. We preprocess these expansions to add suitable prefixes to make them similar to the original persona. For example, expansions relating to xWant and xAttr are prefixed with 'I want' and 'I am' respectively. For each persona sentence, we generate 5 expansions per relation, i.e., in total we will obtain 5 × 9 = 45 expansions per persona sentence.

Paraphrasing
To explore alternative sources for generating commonsense expansions beyond COMET, we consider paraphrasing persona sentences. Paraphrases of a sentence convey almost the same meaning to a listener as the original. Often paraphrases use synonymous phrases or manipulate word-syntax of the original sentence, which implicitly involves both context comprehension and world knowledge (Zeng et al., 2019). We obtain these in two ways: Paraphrase Network To generate paraphrases at scale, we use an off-the-shelf paraphrasing system based on back-translation Federmann et al., 2019) with pre-trained language translation models. We make use of En-Fr and Fr-En pre-trained translation models as the components for back-translation. 2 While we tried other language pairs, the En-Fr pair proved the most satisfactory based on qualitative analysis on 500 samples. We generate 5 paraphrases per persona sentence, which readily provides more lexical and syntactic variants as shown in Figure 2b. Manual Paraphrasing To compare with other expansions, we reuse manually written revised versions of persona sentences provided with PERSONA-CHAT (Zhang et al., 2018) though these are limited to only one paraphrase per sentence. We call them revised for short (see Figure 2a).

Common sense and Persona Aligned Chatbot (COMPAC)
To infuse commonsense context in personagrounded dialog generation, we imbue our dialog model with the expanded persona set instead of only original personas S. But these persona expansions lead to hundreds of new sentences as opposed to only a few given persona sentences which makes it infeasible to encode using a single transformer, as was done in prior works (Wolf et al., 2019). Additionally, encoding all persona sentences as a single text input leads to a lack of interpretability i.e., it is not clear which persona sentence was used by the model in generating a particular response. Instead, we propose COMPAC: Common Sense and Persona Aligned Chatbot that allows us to make a fine-grained choice of a persona sentence to generate the target response. Let C denote a list of expended personas, derived from S (including S itself). We further add a null persona ∅ in C considering that some utterances can purely condition on the dialog context. We are interested in modeling the conditional p(x|H, C) = p(z|H, C)p(x|z, H, C) where z ∈ {1, 2, . . . , |C|} is a latent discrete random variable, unobserved in the data. Given the dialog history H, first we sample a particular persona sentence C z from a prior network p θ (z|H) (see Figure 3). Next, as depicted in Figure 3, the dialog response x is sampled from a generator network p φ (x|H, C z ) by conditioning on the history H and chosen persona sentence C z . In the generative model described above, the latent variable z is a discrete random variable which Mean of RoBERTa subword embeddings as an encoder t k Expansion type for k-th expansion fi i-th feature function for prior network; i ∈ {1, 2, 3} θ Parameters for prior network p θ (z|H, C) φ Parameters for generator network p φ (x|H, Cz) α Parameters for inference network pα(z|x, H, C) Table 1: Summary of notation used in the paper points to a single persona sentence. This decision (of conditioning on a single persona sentence) was based on the observation that most dialog responses in the datasets under consideration are relevant to only one persona sentence. It is possible to allow for multiple persona sentences by defining z to pick a subset of |C| persona sentences instead of picking a single sentence. We leave this as a possible future extension.

Persona Choice Prior
The dialog history H can hold cues regarding which persona sentence might be applicable given the context. For example, in Figure 3 the historical context suggests that 'following fashion trends' can be a consequence of 'being fashionable'. We encode both the dialog history H and persona sentence C k by averaging RoBERTa subword embeddings  as e(H) and e(C k ). We use an implementation from Hugging-Face for RoBERTa 3 with roberta-base as the pretrained model. Then we parameterize the prior p θ (z|H, C) as a log-linear model with the following features: Dialog history We obtain f 1 (H, C k ): a scalar feature using a bilinear product e(H), e(C k ) to align the persona sentences with the dialog history. Expansion types Each k-th persona expansion corresponds to an expansion type t k . In the case of COMET, these types are the nine commonsense relations provided by ATOMIC (see Section 3.1). For paraphrased expansions, we annotate each as type paraphrase and the original persona sentences as original. We consider two additional features with expansion types: (a) f 2 (t k ) that represents a global preference over the relation type embedded via a type embedding layer; and (b) f 3 (t k , H) that appends the expansion type embedding with dialog history encoding e(H), followed by a linear layer to obtain a real-valued score for history-specific preference over the expansion type.
The dimension of the expansion type embedding was set to 5. Finally, the prior model can be represented concisely as

Generator Network
Following prior work (Wolf et al., 2019), we use pre-trained GPT2 (Radford, 2018) (Transformer with 12 layers, 768 hidden size, 12 heads-gpt2-small 4 ) to generate dialog responses given the dialog history H, with the selected persona sentence C z prepended to it. In the case of C z being the null persona, an empty string is prepended. We further append the target response x to the combined context (C z ; H), and feed the sequence to GPT2, after tokenization. To distinguish between persona tokens, history tokens, and target response tokens, we use segment indicators-{Persona, Speaker1, Speaker2}-for which corresponding embeddings are learned via a separate segment embedding layer in the model. We add the segment embedding to the corresponding token embedding in the model input layer. To obtain the conditional likelihood p φ (x|H, C z ), we only consider the target tokens for cross-entropy loss calculation. Wolf et al. (2019) also leveraged incorrect responses given a dialog history from PERSONA-CHAT as negative samples in an auxiliary loss to encourage the correct candidate to obtain the highest likelihood compared to the incorrect ones. However, we did not find any improvement using this loss in COMPAC.

Learning and Inference
Our training data D consists of instances of dialog history H and ground truth dialog responses x. We train our model parameters θ and φ to maximize the likelihood of the target dialog response x given the dialog history: log p(x|H, C; θ, φ) totalled over D.
Since the discrete random variable z is unobserved in the training data, we must marginalize over z to compute the desired likelihood p(x|H; θ, φ): where we drop C from the conditionals for simplicity.
Inference Network Note that the number of persona expansions is typically in the range 150-250, and thus it is computationally expensive to marginalize over the entire selection space of z during training. We instead optimize a variational lower bound (ELBO) of log p(x|H; θ, φ) given as where we use the inference network q α (z|x, H) to compute the approximate posterior (Kingma and Welling, 2014). In our initial experiments, we observe that using an inference network leads to better perplexity values than using samples from the prior.
The architecture of the inference network is similar to that of the prior network, a log-linear model. Along with the features related to dialog history and expansion types, we additionally include another scalar feature: a bilinear product x, C k between the encoded persona and ground truth response x encoded with RoBERTa embeddings to align the persona choice according to the target utterance.
Optimization The parameters of the generator network (φ) and prior network (θ) can be trained directly via back-propagation. Since z is a discrete latent variable, we use REINFORCE (Williams, 1992) to train the inference network parameters α. However, the REINFORCE estimator often suffers from high variance. To reduce the variance, we found it useful to (1) use a moving average baseline (Zhao et al., 2011); and (2) regularize the prior network by penalizing the entropy of the output categorical distribution. To avoid KL mode collapse, we use KL-annealing (Bowman et al., 2016) where we linearly increase the weight of the KL term beginning from 0 to 1 as training progresses.
Decoding At decoding time, we first sample k from the prior p θ (z|H, C), and then C k is fed to the generator network. Following previous work (Wolf et al., 2019), we use nucleus sampling (Holtzman et al., 2020) (with p = 0.95) to decode the final response from the probabilities produced by the generator. We also found that high-temperature sampling from the prior often leads to more diverse generation.

Experiments
We conduct our experiments based on the following desiderata: (1) Do persona expansions help to generate high quality and diverse responses? (2) Does COMPAC achieve accurate persona grounding given a dialog context? (3) Does COMPAC enable persona-consistent and controllable generation? Hyperparameter details are in Appendix §A.

Baselines
To demonstrate the efficacy of COMPAC, we compare it with three competitive baselines on the PERSONA-CHAT dataset:    2019) to obtain the best automatic metric in the ConvAI2 competition.
A minimal version of COMPAC is also considered, COMPAC-original, which only uses the original persona, for a direct comparison with other model architectures that only use the original persona. Furthermore, to justify the choice of fine-grained persona grounding for an effective utilization of persona expansions, we also consider baseline versions of GPT2 trained with each of the expansion strategies: GPT2-revised, GPT2-paraphase, and GPT2-COMET. To show that COMPAC can work with persona expansions derived from various sources, we compare with versions of COMPAC trained with paraphrasebased expansions: COMPAC-revised and COM-PAC-paraphrase. By default, COMPAC indicates it is trained with COMET expansions.

Comparison of Dialog Quality
We measure perplexity for language modeling performance, and BLEU-1 (Papineni et al., 2002) and BLEU-2 (Vedantam et al., 2015) scores between generated and gold utterances to measure the fidelity of the generated responses. Given our goal of generating engaging responses with novel information, we deem it important to consider the diversity in the generated responses which we measure using D-1 and D-2 (percentage of distinct uniand bi-grams respectively) (Li et al., 2016). Table 2 shows that COMPAC outperforms three competitive baselines when trained on the original persona in all quality metrics indicating the efficacy of our architecture. Moreover, when combined with persona expansions, we observe a modest 3-8 point decrease in perplexity and a large improvement in both BLEU and diversity scores which confirms that COMPAC successfully leverages the persona expansions to improve dialog quality. COM-PAC trained with COMET expansions achieves the best performance both in terms of fidelity and diversity which shows that COMET expansions help the model to respond to implicit context with commonsense and to explore novel information. But with revised personas, we find that both COMPAC and GPT2 provide marginal performance gains, mirroring the observation from (Zhang et al., 2018). Finally we observe gradual degradation in performance when we trivially finetune GPT2 with paraphrase and COMET expansions. Note that GPT-2 could have implicitly learned to focus on a single persona attribute. However, the proposed COM-PAC model performs better suggesting that finegrained persona grounding acts as a useful inductive bias in effectively utilizing larger expansion sets.

Human Evaluation for Dialog Generation
Automatic evaluation of dialog systems is still notoriously unreliable (Liu et al., 2016;Novikova et al., 2017) and such systems should be evaluated by human users. Hence, we perform pairwise comparisons between responses generated our best system, COMPAC trained on COMET expansions, and responses generated by four strong baselines: GPT2, GPT2-COMET, COMPAC-original, COM-PAC-paraphrase (the best COMPAC model with paraphrase expansions). We also consider the gold responses for comparison. We conduct a human evaluation with 100 test examples on three aspects critical for practical use: (1) Fluency measures whether the generated output is fluent (in English); (2) Engagement measures whether the generated response is engaging or interesting; and (3) Relevance measures whether the generated output is relevant with respect to the dialog history. More details of the evaluation are in Appendix §B. Table 4 shows that human annotators found responses generated by COMPAC trained with COMET expansions more engaging as compared to responses from all the baselines as well as the gold responses by statistically significant margins. This confirms our hypothesis that COMET expansions were helpful in adding novel content. Human judges also found that despite a significant drop in perplexity, COMPAC was not more fluent than COMPAC-original and COMPAC-paraphrase with statistical significance, indicating similar language modeling performance. We find the inter-annotator agreement, as measured by Cohen's kappa (Cohen, 1960), for fluency, engagement, and relevance were 0.62, 0.71, and 0.73 respectively.

Fine-grained Persona Grounding
Next we want to investigate the extent of COM-PAC's ability to ground the response generation with a fine-grained persona choice as a probing experiment. Specifically, we want to measure whether our model can choose a coherent persona from the available persona sentences given the dialog context. Note that in persona-grounded chitchat, not all utterances are tied to a personas and could be    and collect persona-utterance pairs that belong to an entailment relation. This results in a subset of 4,613 utterances with associated ground truth persona sentences in our test set. Next, we obtain a persona sentence by performing argmax over the prior p θ (z|H, C) as well as the inference network q α (z|x, H, C) from our COMPAC models and calculate accuracy with the ground truth persona. For models that use expanded personas, we track the original persona from the retrieved expansion for accuracy calculation. Table 5 shows that COMPAC with COMET achieves the most accurate persona grounding suggesting that inference networks can approximate the true posterior better when a commonsense persona is available for grounding. In the case of the prior, a better entailment accuracy than random chance (1/5) confirms our choice of the history-conditioned prior network rather than a uniform prior. Human Evaluation Since DNLI does not entail expanded personas, we conduct a human evaluation to judge the relevance of a chosen persona expansion sampled from the inference network. Specifically, we ask: Is this knowledge relevant to the given dialog history?-with options as 'Yes ', 'No', and 'Uncertain'-and Table 6: Conditional generation performance on the PERSONA-CHAT test set to show the similarity between generated responses and grounding persona sentences. We omit GPT2-based models since they do not select a particular persona sentence for grounding. ment, as measured by Cohen's kappa was 0.76. Again, Table 5 shows that models with COMET expansions can choose the most relevant persona sentence which corroborates our claim in persona entailment experiments. On average, we noticed that COMPAC with COMET expansions prefers to choose expanded personas 87% of the time out of all non-null persona choices. This reduces to 62% in the case COMPAC-paraphrase. In contrast, COM-PAC-revised tends to select an original persona over an expansion more often.

Controllable Generation
Controllable generation of persona-grounded dialog can help to generalize the dialog agent to newer persona details just by changing the grounding in the conditional generator. While controllable text generation with a desired attribute has gained interest recently (Dathathri et al., 2020;Kong et al., 2019), we investigate the possibility of controlling generation with a desired persona and measure the performance of the conditional generator. For this, we observe a set of knowledge overlap metricsthe unigram recall/precision/F1 scores-from Dinan et al. (2019b) and BERT score (Zhang et al., 2020) for semantic similarity between the generated responses and the persona retrieved. Table 6 shows that conditional generation is strongest when COM-PAC is trained with COMET suggesting commonsense expansions are more appropriate to the dialog context in influencing the response generation. Next, we create a diagnostic dataset of 100 examples where we manually edit the persona by changing an entity in a persona sentence or swapping the selected persona expansion with another relevant one (See examples in Table 7) to directly   Table 7 this suggests that COMPAC supports controllable generation with contextually modified personas. Table 3 shows responses from different models for a sample dialog context. Qualitatively, we find that COMPAC with COMET expansions responds to the context with commonsense using novel content from a commonsense expansion (being Hindu → to learn about Hinduism), where other responses remain generic or incoherent. In Table 8, we illustrate responses generated by the COMPAC model along with the underlying persona choice sampled from the prior network. Cases show that COMPAC successfully chooses an original or an expanded persona sentence, as appropriate, but also defaults to the null persona (∅) that leads to a bland response.

Related Works
Building personalized dialog agents has been a popular task recently, thanks to Zhang et al. (2018) who extensively studied the task with a new dataset PERSONA-CHAT, later as a form of a challenge (Dinan et al., 2019a), where the dialog agent is seeded with a predefined persona in the form of multiple sentences of textual description, mirroring a casual human conversation which many times  draws snippets from individual personal experiences and facts. Recent works focus on improving persona-grounded dialog generation performance (Wolf et al., 2019;Mazaré et al., 2018;Bao et al., 2019) as well as persona consistency in generated dialog Song et al., 2019a). Bao et al. (2019) proposed a reinforcement-learning-based framework that promoting informativeness and persona-consistency via personal knowledge exchange. Xu et al. (2020) focused on using plausible topical keywords related to the available persona facts using a neural topic model to explore beyond the given knowledge, possibly closest to our work. We rather focus on obtaining commonsense implications of the given persona in the form of text snippets that are more expressive than topical keywords. Persona-grounded dialog generation is a special case of knowledge-grounded dialog generation. Knowledge grounding in dialog has many realworld applications that are well-studied in recent literature (Zhou et al., 2018;Ghazvininejad et al., 2018;Dinan et al., 2019b;. In this work we use fine-grained grounding/selection on persona which performed better than encoding the entire persona for each response. Such finegrained selection has been found useful in prior works on text generation such as dialog  and image captioning (Jhamtani and Berg-Kirkpatrick, 2018). For dialog generation, a contextual knowledge selection has been successfully applied in prior works (Parthasarathi and Pineau, 2018). Specifically, Zhao et al. (2017) and later Song et al. (2019b) proposed a conditional-VAE framework to learn latent context given the dialog history to guide knowledge selection.
Finally, few recent works focused on augmenting grounding with commonsense knowledge with successful applications in open-domain topical dialog generation (Ghazvininejad et al., 2018;Moon et al., 2019), story generation (Mao et al., 2019) and sarcasm generation (Chakrabarty et al., 2020). In this work, we extend this effort into persona-grounded dialog generation via augmenting grounding persona with commonsense knowledge.

Conclusion
In this work, we showed that expanding persona sentences with commonsense helps a dialog model to generate high-quality and diverse personagrounded responses. Moreover, we found that finegrained persona grounding is crucial to effectively condition on a large pool of commonsense persona expansions, which further provided additional controllability in conditional generation.
While our expansions are limited by the performance of COMET or paraphrase systems, we envision future work to train the dialog model endto-end along with the expansion generation. As future work, we would like extend the prior network to sample more than one persona sentences by expanding the sample space of the discrete random variable to generate more interesting responses.

A Implementation Details
We obtain the PERSONA-CHAT dataset from ParlAI repository 5 . For COMET expansions, we use the code 6 released by the authors of COMET (Bosselut et al., 2019). We performed BPE tokenization with the GPT2Tokenizer 7 .
Network architectures For the generator network, we use GPT2 (Transformer with 12 layers, 768 hidden size, 12 heads-gpt2-small 8 ) following the state-of-the-art model (Wolf et al., 2019) from Conv-AI2 competition. Wolf et al. (2019) also leveraged incorrect responses given a dialog history from PERSONA-CHAT as negative samples in an auxiliary loss to encourage the correct candidate to obtain the highest likelihood compared to the incorrect ones. However, we did not find any improvement using this loss in COMPAC. COM-PAC has total of 164 Million parameters whereas GPT2 based baseline has 124 Million parameters.
Hyperparameters Following (Wolf et al., 2019) we use history size 2 (i.e. 4 previous utterances). We use AdamW optimizer (Loshchilov and Hutter, 2017) and the learning rate was set at 6.25e − 5 with a linear decay of step size 10 −1 per epoch. The baseline in REINFORCE was done with a discounted moving average with a ratio of 0.99. The REINFORCE loss coefficient was set at 0.8 and the language modeling loss coefficient was set to 1.0.
Training Each model converged in 3 epochs on an average with batch size 4 in a TITAN X (Pascal) GPU that took 12 hours in total. While training, we only observe perplexity on the validation set to employ an early-stopping criteria.

B Evaluation
Automatic Evaluation During dialog quality evaluation, perplexity is measured by adapting the official evaluation protocol from the Conv-AI2 challenge 9 .
To assess persona grounding, we use Dialogue Natural Language Inference (DNLI) dataset  that has persona-utterances pairs under three classes-entailment, neutral, and contradiction. We gather all the entailment pairs including all splits that resulted in 44,000 personautterance pairs. Then we map with the PERSONA-CHAT test set to obtain 4,613 utterances associated with a ground truth persona.
For assessing conditional generation performance, we use BERT score from the publicly available repository 10 . Human Evaluation For human evaluation, we hired two Anglophone (Lifetime HIT acceptance % > 80) annotators for every human-evaluated test generation. Figure 4 shows a sample question for a human judge for the pairwise comparison of a response generated by COMPAC and a response generated by a baseline for three aspects-fluency, engagement, and coherence.
While measuring persona grounding, we used a similar setup where we provided a dialog history and a sampled expansion and asked 'Is this knowledge relevant to the given dialog history?'-with three options -'Yes', 'No', and 'Uncertain' (See Figure 5). Similar to the previous human evaluation study, we hired two Anglophone (Lifetime HIT acceptance % > 80) annotators for each question. We find the inter-annotator agreement, as measured by Cohen's kappa was 0.76.

C Generation Examples
Tables 9 to 12 present generations from COM-PAC for sampled dialog histories with comparison across baselines.