Generate, Delete and Rewrite: A Three-Stage Framework for Improving Persona Consistency of Dialogue Generation

Maintaining a consistent personality in conversations is quite natural for human beings, but is still a non-trivial task for machines. The persona-based dialogue generation task is thus introduced to tackle the personality-inconsistent problem by incorporating explicit persona text into dialogue generation models. Despite the success of existing persona-based models on generating human-like responses, their one-stage decoding framework can hardly avoid the generation of inconsistent persona words. In this work, we introduce a three-stage framework that employs a generate-delete-rewrite mechanism to delete inconsistent words from a generated response prototype and further rewrite it to a personality-consistent one. We carry out evaluations by both human and automatic metrics. Experiments on the Persona-Chat dataset show that our approach achieves good performance.


Introduction
In an open-domain conversation scenario, two speakers conduct open-ended chit-chat from the initial greetings and usually come to focus on their characteristics, such as hobbies, pets, and occupations, etc., in the course of the conversation.For humans, they can easily carry out conversations according to their personalities (Song et al., 2019a), but fulfilling this task is still a challenge for recent neural dialogue models (Welleck et al., 2019).
One main issue is that these models are typically trained over millions of dialogues from different speakers, and the neural dialogue models have a propensity to mimic the response with the maximum likelihood in the training corpus (Li et al., 2016b), which results in the frequent inconsistency in responses (Zhang et al., 2018).Another issue Figure 1: A common problem for persona-based dialogue models is that they can hardly avoid the generation of inconsistent persona words.Although the model generates a response which looks good, it is an inconsistent one.With further rewriting, the model can focus more on improving persona consistency.
is the user-sparsity problem (Qian et al., 2017) in conventional dialogue corpora (Serban et al., 2015).Some users have very few dialogue data, which makes it difficult for neural models to learn meaningful user representations (Li et al., 2016b).
To alleviate the above issues, Zhang et al. (2018) introduced the Persona-Chat dataset to build more consistent dialogue models.Different from conventional dialogue corpora, this dataset endows dialogue models with predefined personas, which is in the form of textually described profile (as shown in the first line of Figure 1).The persona-based dialogue models also adopt an encoder-decoder architecture and are enhanced with persona encoding components, such as memory network (Sukhbaatar et al., 2015) and latent variable (Kingma and Welling, 2013).These models turn out to produce more consistent responses than the persona-free ones (Zhang et al., 2018;Song et al., 2019a).
Despite the successful application of the encoderdecoder framework in persona-based dialogue models, one concern is that they lack extra attention to the key persona information.The model will learn to minimize the overall loss of every decoded word, but this may lead to the neglect of the key personas: change of one persona-related word may not significantly affect the overall loss, but could turn a good response into a totally inconsistent one.As shown in Stage 1 of Figure 1, only one improper word "Colorado" leads the response to be inconsistent.
A desirable solution should be able to capture personas and automatically learn to avoid and refine inconsistent words before the response.In this paper, we present a Generate-Delete-Rewrite framework, GDR, to mitigate the generation of inconsistent personas.We design three stages specifically for the goal of generating persona consistent dialogues: The first Generate stage adopts a transformer-based generator to produce a personabased response prototype.The second Delete stage employs a consistency matching model to identify inconsistencies and delete (by masking) the inconsistent words from the prototype.Finally, in the Rewrite stage, a rewriter polishes the masked prototype to be more persona consistent.To examine the effectiveness of our GDR model, we carried out experiments on the public available Persona-Chat dataset (Zhang et al., 2018).
We summarize the main contributions as follows: • A three-stage end-to-end generative framework, GDR, was proposed for the generation of persona consistent dialogues.
• A matching model was integrated into the generation framework to detect and delete inconsistent words in response prototype.
• Experimental results show the proposed approach outperforms competitive baselines on both human and automatic metrics.

Related Work
End-to-end dialogue generation approaches are a class of models for building open-domain dialogue systems, which have seen growing interests in recent years (Vinyals and Le, 2015;Shang et al., 2015;Serban et al., 2016;Li et al., 2016c;Zhao et al., 2017;Li et al., 2017).These dialogue models adopted recurrent units in a sequence to sequence (seq2seq) fashion (Sutskever et al., 2014).Since the transformer has been shown to be on par with or superior to the recurrent units (Vaswani et al., 2017), some dialogue models began to take advantage of this architecture for better dialogue modeling (Dinan et al., 2018;Su et al., 2019).
Besides the advancements in dialogue models, the emergence of new dialogue corpus has also contributed to the research field.Zhang et al. (2018) introduced the Persona-Chat dataset, with explicit persona texts to each dialogue.Based on seq2seq model and memory network, they further proposed a model named Generative Profile Memory Network for this dataset.Following this line, Yavuz et al. (2019) designed the DeepCopy model, which leverages copy mechanism to incorporate persona texts.Song et al. (2019a) integrated persona texts into the Per-CVAE model for generating diverse responses.However, the persona-based models still face the inconsistency issue (Welleck et al., 2019).To model the persona consistency, Welleck et al. (2019) annotated the Persona-Chat dataset and introduced the Dialogue Natural Language Inference (DNLI) dataset.This dataset converts the detection of dialogue consistency into a natural language inference task (Bowman et al., 2015).
Personalized dialogue generation is an active research field (Li et al., 2016b;Qian et al., 2017;Zhang et al., 2018;Zheng et al., 2019a,b;Zhang et al., 2019).In parallel with this work, Song et al. (2019b) leveraged adversarial training to enhance the quality of personalized responses.Liu et al. (2020) incorporated mutual persona perception to build a more explainable (Liu et al., 2019) dialogue agent.Other relevant work lies in the area of multi-stage dialogue models (Lei et al., 2020).Some retrieval-guided dialogue models (Weston et al., 2018;Wu et al., 2019;Cai et al., 2019a,b;Su et al., 2020) also adopted a multi-stage framework, but the difference from our work is obvious: we generate the prototype rather than retrieve one.A high-quality retrieved response is not always available, especially under the persona-based setting.

Overview
In this work, we consider learning a generative dialogue model to ground the response with explicit persona.We focus on the persona consistency of single-turn responses, and we leave the modeling of multi-turn persona consistency as future work.
Formally, we use uppercase letters to represent sentences and lowercase letters to represent words.Let Q = q 1 , q 2 , ..., q n denotes the input query with n words, and let P = {P (1) , P (2) , ..., P (k) } be the k different persona texts, where m i is the i-th persona text with m i words.Our goal is to learn a dialogue model M to generate a response Ŷ = y 1 , y 2 , ..., y k , which is consistent with the persona, based on both query Q and persona P .In abbreviation, Ŷ = M(Q, P ).
More concretely, as shown in Figure 2, the proposed model M consists of three parts: 1) Prototype generator G.This component takes persona texts and query as input and generates a response prototype for further editing.It adopts an encoder-decoder architecture (Sutskever et al., 2014), with the transformer (Vaswani et al., 2017) applied in both the encoder and the decoder.
2) Consistency matching model D. This model is designed to detect and delete those words in the prototype that could lead to inconsistency.We train this model in a natural language inference fashion on the DNLI (Welleck et al., 2019) dataset.
3) Masked prototype rewriter R. The rewriter learns to rewrite the response prototype to a more consistent one.It is also a transformer decoder, which adopts a similar architecture as the decoder of G.The difference lies in that it takes the masked prototype, rather than the query, as input.

Generate: Prototype Generator
We apply the encoder-decoder structure to build our prototype generator G.For the encoder, we use the self-attentive encoder in the transformer.For the decoder, built upon the transformer decoder, we propose a tuple-interaction mechanism to model the relations among persona, query, and response.

Self-Attentive Encoder
As the persona P is composed of several sentences, we unfold all words in P into a sequence p Then we use the self-attentive encoder (Vaswani et al., 2017) to compute the representations of the persona texts and query separately.The multi-head attention is defined as MultiHead(Q, K, V ), where Q,K,V are query, key, and value, respectively.The encoder is composed of a stack of N G identical layers.Take the first stack encoding of P for example: V (1)  p = MultiHead(I(P ), I(P ), where V (1) is the first layer result of the multi-head self-attention and I(•) is the embedding function of the input.The input embedding for word w i is the sum of its word embedding and position embedding.O (1) denotes the output of the first layer feed-forward network.For other layers: where n =2,...,N G .We applied layer normalization to each sublayer by LayerNorm(x + Sublayer(x)).
Q is encoded in the same way.After N G identical layers, we can get the final representations O , where O are the encoded persona and encoded query, respectively.

Tuple-Interaction Decoder
In the decoding phase, there are three types of information, persona P , query Q, and response Y , which make up a tuple (P ,Q,Y ).Accordingly, three inter-sentence relations need to be considered: (1) The alignment between Q and Y is beneficial to yield better results (Bahdanau et al., 2014).( 2) As the persona is composed of several sentences and describes different aspects, we need to find the most relevant persona information according to the relations between P and Y. (3) We also want to know whether the query needs to be answered with the given persona.Thus we should take the relations between P and Q into account.
Considering the above factors, we design a twolayer tuple-interaction mechanism in the decoder, as shown in the first part of Figure 2.There are three attentions in two layers: query attention (Q-Attn) and persona attention (P-Attn) in the first layer, and persona-query attention (PQ-Attn) in the second layer.N G such identical layers compose of the decoder.For the first layer: O (1) where E (1) and F (1) are the results of the first layer P-Attn and Q-Attn.T (1) is the result of the first layer PQ-Attn.O (1) dec denotes the first layer output.Note that the Y here is masked to ensure depending only on the known words (Vaswani et al., 2017).Repeatedly, for other layers: where n =2,...,N G .After N G layers, the decoder output O dec is projected from hidden size to vocabulary size, then followed up by a softmax function to get the words' probabilities: where W 3 is a hidden size×vocabulary size weight matrix and b 3 is the bias term with vocabulary size dimension.And Prob (1) denotes the output distribution of the first stage.Now we can get the response prototype Ŷ (1) from the Prob (1) .

Delete: Consistency Matching Model
The goal of the consistency matching model D is to reveal word-level consistency between the persona texts and the response prototype, thus the inappropriate words can be deleted from the prototype.This model is trained to estimate the sentencelevel entailment category (Bowman et al., 2015) of a response for the given persona texts, which includes entailment, neutral and contradiction.The key is that if the category is not entailment, we can delete the most contributing words by replacing them with a special mask token, thus giving the model a chance to rephrase.The attention weights can measure each word's contribution.
The architecture of our consistency matching model is shown in Figure 3. From bottom to top are the self-attentive encoding layer, cross attention layer, and consistency matching layer.
As described in section 3.2, the self-attentive encoder (SAE(•)) performs self-attention over the input to get sentence representations.Because the task of consistency matching is quite different from dialogue generation, we did not share the encoders between the generator G and matching model D: where Ā is a hidden size × n matrix.Ā = [ā 1 , ā2 , ..., ān ] and B = [ b1 , b2 , ..., bm ].The n and m are the number of words in persona P and response prototype Ŷ (1) .Here we applied average pooling stragety (Liu et al., 2016;Chen et al., 2017) to get the summary representations: and we can get the response attention weights and attentive response representations by: where W b is attention weights and B is response representations.Similarly, we can get W a and A.
Once A and B are generated, three matching methods (Chen et al., 2017) are applied to extract relations: concatenation, element-wise product, element-wise difference.The results of these matching methods are concatenated to feed into a multi-layer perceptron, which has three layers and tanh activation in between.The output is followed up by a SoftMax function to produce probabilities.
In the inference process, as shown in Figure 3, the response attention weights W b is leveraged to illustrate the inconsistent words, which will be deleted1 .In practice, we use a simple heuristic rule for deleting words: as long as the category is not entailment, we will delete 10% of the words (at least one word)2 , with the highest attention weight, in the prototype Ŷ (1) .In this way, we get the masked prototype Ŷ (2) .

Rewrite: Masked Prototype Rewriter
The rewriter R takes the masked prototype and persona texts as input and outputs final response.
R is also a transformer decoder, which is similar to the decoder of G in section 3.2, but with a minor difference: the masked prototype is close to the target response, thus the direct attention between the prototype and target response is needless.The architecture of R can be seen in the third part of Figure 2, which can be formalized as:  is the encoded persona.After N R identical layers, the same generation process as in G is applied to the O rw , and we can get the final response Ŷ (3) .

Training
The consistency matching model D is trained separately from the prototype generator G and rewriter R. As forementioned, the matching model D is trained in a natural language inference fashion on the DNLI dataset (Welleck et al., 2019), which has been well defined by the previous studies (Bowman et al., 2015;Chen et al., 2017;Gong et al., 2018).We minimize the CrossEntropy loss between the outputs of D and the ground truth labels.
The G and R share the same training targets.We trained them by the standard maximum likelihood estimate.Notice that there are two different deleting strategies in training: (1) In the warm-up phase, because the G can hardly generate high-quality prototypes at this period, we randomly delete each word, with a 10% probability, from the prototype.
(2) After that, the trained consistency matching model D is leveraged to delete words.

Datasets
We carried out the persona-based dialogue generation experiments on the public available Persona-Chat dataset (Zhang et al., 2018).Furthermore, we trained the consistency matching model on the recently released Dialogue Natural Language Inference (DNLI) dataset (Welleck et al., 2019).
We show the statistics of the Persona-Chat dataset in Table 1.The DNLI dataset (Welleck et al., 2019) is an enhancement to the Persona-Chat.
It is composed of persona-utterance pairs from the Persona-Chat, and these pairs are further labeled as entailment, neutral, and contradiction.Some statistics of this dataset are given in Table 2.

Compared Models
To the best of our knowledge, this is an early work in modeling explicit persona consistency.To show the effectiveness of our models, we mainly compare it with the persona-based dialogue models: • S2SA S2SA is an RNN-based attentive seq2seq model (Bahdanau et al., 2014).It only takes the query as input.
• Per-S2SA This is a seq2seq model that prepends all persona texts to the query as input (Zhang et al., 2018).
• GPMN Generative Profile Memory Network is an RNN-based model that encodes persona texts as individual memory representations in a memory network (Zhang et al., 2018).
• Per-CVAE This is a memory augmented CVAE model to exploit persona texts for diverse response generation (Song et al., 2019a).
• Transformer Different from the RNN-based models, transformer is a self-attention based sequence transduction model (Vaswani et al., 2017).The persona texts are concatenated to the query to serve as its input.

Experimental Settings
For all the RNN-based baseline models, they are implemented by two-layer LSTM networks with a hidden size 512.For the Transformer, the hidden size is also set to 512, and the layers of both encoder and decoder are 3.The number of heads in multi-head attention is 8, and the inner-layer size of the feedforward network is 2048.The word embeddings are randomly initialized, and the embedding dimension of all models is set to 512.Our model applies the same parameter settings as the transformer.The number of layers N G = N D = N R = 3. G and R share the word embeddings, but the matching model D uses independent embeddings.We use token-level batching with a size 4096.Adam is used for optimization, and the warm-up steps are set to 10,000.We implemented the model in OpenNMT-py (Klein et al., 2017).

Evaluation Metrics
In the evaluation, there are two essential factors to consider: persona consistency and response quality.We apply both human evaluations and automatic metrics on these two aspects to compare different models.

Human Evaluation
We recruit five professional annotators from a third-party company.These annotators have high-level language skills but know nothing about the models.We sampled 200 persona-query-response tuples per model for evaluation.Duplicated queries (such as greetings which appear more than once) will not be sampled twice.
First, we evaluate the persona consistency of a response.The annotators are asked to decide whether the response is consistent with the given persona.0 indicates irrelevant or contradictory and 1 indicates consistent (Const.).
Second, we evaluate the quality of a response on three conventional criteria: fluency (Fluc.),relevance (Relv.), and informativeness (Info.).Each aspect is rated on five-scale, where 1, 3, and 5 indicate unacceptable, moderate, and excellent performance, respectively.2 and 4 are used for unsure.Dziri et al. (2019) has shown that natural language inference based entailment ratio can be used as an indicator of dialogue consistency.Here we trained two well-performed NLI models, DIIN (Gong et al., 2018) and BERT (Devlin et al., 2019), to automatically predict the category of persona-response pairs, and we calculated the ratio of entailment as an additional reference to the persona consistency.In our experiments, DIIN and BERT achieved 88.78% and 89.19% accuracy on the DNLI test set, respectively, compared with previous best results 88.20%.The trained models are then leveraged for calculating entailment ratios.Two model-based entailment ratios are abbreviated as Ent diin and Ent bert .

Automatic Metrics
For dialogue quality, we follow Zhang et al. (2018) to use perplexity (PPL) to measure the fluency of responses.Lower perplexity means better fluency.Besides, we also use Dist-1 / Dist-2 (Li et al., 2016a) to examine the model's ability to generate diverse responses, which is the ratio of distinct uni-grams / bi-grams.

Main Results
We report the main evaluation results in Table 3.
Compared with baseline methods, our GDR model obtains the highest consistency score of 49.2% in human evaluation, which is significantly better than other methods.The target responses in the sampled data are also annotated, and 65.4% of them expressed persona information.Moreover, the two model-based entailment ratios, which are calculated on the whole test set, also prove the effectiveness of our GDR model.Although the two NLI models differ in results, our GDR model ranks first under the evaluation of both DIIN and BERT.For dialogue quality, our proposed model has a remarkably lower perplexity of 16.7 than all other baseline methods.An analysis can be seen in Section 4.6.Besides, our distinct-2 metric is even significantly better than the Per-CVAE model, which is designed to generate diverse responses.
Additionally, we carried out pairwise response comparison to see the dialogue quality gains.We report the results in Table 4.While the GDR model significantly improves persona consistency, it can still generate high-quality responses like the transformer and GPMN.

More Analysis
As the proposed model achieves better performance than baseline methods, we turn to ablation tests to further quantify the contributions made by different components.We ablated our model through several different approaches: • GR It removes the matching model D, i.e., generates a prototype and rewrites it directly.
• GRdR This approach replaces the matching model D with 10% random deleting (Rd), thus to see if the masked prototype, extracted by our matching model D, is beneficial.
• G Our model's generator, without further consistency matching and rewriting.
• T It is a transformer generator but removes the tuple-interaction in section 3.2 and directly concatenates persona texts to the query.This model is equivalent to the vanilla transformer.
We report the results in Table 5.First, we look into which components contribute to the consistency.As seen, from T, G, GR to GDR, every step has an observable improvement in Const., indicating the effectiveness of our model's design.Both the tuple-interaction in G and the rewriting process in R contribute to the improvements of persona consistency.The GRdR approach, with nothing different from GDR but a random deleting strategy, serves as a foil to our GDR model, which indicates a well-learned consistency matching model is of great benefit to our three-stage generation framework to generate persona consistent dialogues.Second, we investigated the improvement of our perplexity.As we can see, the one-stage transformer approaches G and T have a perplexity higher than 26.In contrast, after we add the rewriter R, the perplexity of all approaches has a significant decline, no matter whether there is a matching model D. Lower perplexity means lower cross-entropy, which indicates the responses from the models have more ground truth words.To some extent, perplexity verifies the human evaluation results of the consistency.One reason for this improvement could be that the rewriter works like a denoising autoencoder (Vincent et al., 2008), and it can focus more on the reconstruction of the missing information of sequence itself, rather than learning to map a sequence to an entirely different one.
We observed that the relevance scores of GR, GRdR, and G are a little inferior to the T.Even the GDR model is not significantly better than T on the relevance score.One plausible explanation is that all these models are specially designed for integrating persona information, although they obtain much better consistency score, it may come at the cost of relevance score.
Moreover, we compared the GDR's response quality with three ablated models and reported it in Table 6.As we can see, the deleting and rewriting, which are designed for improving consistency, also have a positive effect on the dialogue quality.
At last, we presented some generated examples Persona i.My mother is a dentist ii.I'm currently looking for a job Query I want to become a physical therapist.

Gen&Del
I wish I could be a nurse .Rewrite I want to be a dentist like my mom.  in Table 7, together with the visualization of attention weights from match module D. In the first case, although the generated prototype is neutral regarding the persona, the word "nurse" is still masked according to our strategy.And after the rewriting stage, the final response expresses persona.In the second case, the prototype is potentially contradictory to the persona, and the keyword is successfully deleted by the matching model D. In the third case, the prototype is consistent with the persona, and no word is deleted.As a result, the final output response is the same as the output of no deletion model GR.In these cases, both consistency and quality are improved after the final rewriting.
In this paper, we presented a three-stage framework, Generate-Delete-Rewrite, for persona consistent dialogue generation.Our method adopts transformer architecture and integrates a matching model to delete the inconsistent words.Experiments are carried out on public-available datasets.Both human evaluations and automatic metrics show that our method achieves remarkably good performance.In the future, we plan to extend our approach to improve the consistency of multi-turn dialogues.

I
'm a recording engineer; I live in California Hi, Kevin here.I love Mexican food.Hi I am Tom.I am in Colorado.Where do you live?Hi I am Tom.<mask> <mask> <mask><mask>.Where do you live?Hi I am Tom.I'm an engineer in California.Where do you live?

Figure 3 :
Figure 3: The architecture of our consistency matching model."•" and "−" denote element-wise product and difference.The dotted line shows inference process, including consistency matching and word deleting.
w/o Delete I want to become a nurse like you.Transfmr I hear you.I am a nurse and a mom.Per-S2S What kind of business?GPMN I am a professional painter, what do you do?DeepCopy What do you do for a living?Per-CVAE I guess I would buy you one Persona i. Gymnastics is my favorite sport ii.I'm a gold medalist olympian Query I am a mini van driver and I drive around.Gen&Del That's awesome .I also love to ski .Rewrite That's cool.I am a champion gymnast.w/o Delete That's awesome.I am going to ski.Transfmr Oh.I am still in high school .Per-S2S I drive a big van, but around the house.GPMN I drive a FIAT.I love it.DeepCopy What do you do for a living?Per-CVAE That is why you like dogs?Tell me.Persona i.I like to grill outdoors ii.I have been retired for years Query Old enough to work in construction.you?Gen&Del I'm retired .I spend time outdoors .Rewrite I'm retired.I spend most of my time outdoors.w/o Delete I'm retired.I spend most of my time outdoors.Transfmr I am a boy and work in the construction.you?Per-S2S No, I am a student.GPMN I am a teacher, I love to be a teacher.DeepCopy I work in the construction industry.Per-CVAE I am a retired officer I love my bike ride.
Rewrite stage).The italics denote the inputs of each stage, and the boldfaces denote the outputs.All the attentions (attn) here refer to the multi-head attention.For the sake of brevity, we omitted some layers of the transformer in this figure.

Table 1 :
Some statistics of Persona-Chat dataset.Valid denotes Validate and Q-R denotes Query-Response.

Table 2 :
Key statistics of DNLI dataset.

Table 3 :
Results of human evaluations (on the left) and automatic metrics (on the right).The Dist-1.& 2. are scaled by 10 −2 .Significant tests (t-test) are performed, and our method is significantly better than all methods on most metrics (p-value<0.05), with the exceptions marked by †.We also present two model-based ratios, the Ent diin and the Ent bert , as an additional reference for persona consistency assessments.Note that the automatic metrics are calculated on the whole test set.* The sampling process in CVAE leads to very unstable PPL.

Table 4 :
GDR response quality gains over other baseline methods on a pairwise human judgment.

Table 5 :
Results of the ablation study.GDR is significantly better than the ablated approaches, with an only exception marked by ‡.

Table 6 :
Pairwise human judgment on response quality.

Table 7 :
Example responses from different models, with a visualization of the consistency matching weights.Strikethrough words are the masked words in Delete stage.The w/o Delete is the ablated model GR in section 4.6, and Transfmr is short for Transformer.