PEDNet: A Persona Enhanced Dual Alternating Learning Network for Conversational Response Generation

Endowing a chatbot with a personality is essential to deliver more realistic conversations. Various persona-based dialogue models have been proposed to generate personalized and diverse responses by utilizing predefined persona information. However, generating personalized responses is still a challenging task since the leverage of predefined persona information is often insufficient. To alleviate this problem, we propose a novel Persona Enhanced Dual Alternating Learning Network (PEDNet) aiming at producing more personalized responses in various open-domain conversation scenarios. PEDNet consists of a Context-Dominate Network (CDNet) and a Persona-Dominate Network (PDNet), which are built upon a common encoder-decoder backbone. CDNet learns to select a proper persona as well as ensure the contextual relevance of the predicted response, while PDNet learns to enhance the utilization of persona information when generating the response by weakening the disturbance of specific content in the conversation context. CDNet and PDNet are trained alternately using a multi-task training approach to equip PEDNet with the both capabilities they have learned. Both automatic and human evaluations on a newly released dialogue dataset Persona-chat demonstrate that our method could deliver more personalized responses than baseline methods.


Introduction
Generating human-like conversations by chatbots has been a long-term goal in artificial intelligence. End-to-end neural generative models attract much attention as potential solutions to open-domain dialogue systems, which aim to conduct smooth and natural conversations and gain trust from users. The Sequence-to-Sequence (Seq2Seq) models (Sutskever et al., 2014;Vinyals and Le, 2015;Shang et al., 2015) have achieved success in generating semantic and syntactic responses for a given conversation context. Nevertheless, Seq2Seq tends to generate short and generic responses like "I don't know" and "I'm fine" as they predict the response by using the maximum-likelihood estimation (MLE) objective function (Li et al., 2016a). And the generated responses are often lack of consistent personality since Seq2Seq is trained over collected conversations from different speakers (Li et al., 2016b;Zhang et al., 2018).
To tackle these issues, various methods have been proposed toward diverse and consistent response generation, by endowing open-domain chatbot with a coherent persona. For example, Zhang et al. (2018) introduced persona profiles in response generation where the persona information is used to enrich responses, and their experiments demonstrated improvements over those not leveraging persona information. Lian et al. (2019) further proposed a persona selection method, which focused on selecting the optimal persona to generate personalized responses efficiently. These efforts have obtained considerable attention and proven that leveraging the additional persona information can promote more user engagement in open-domain dialogue systems. Table 1 gives several examples of persona-based conversation. The generated responses are supposed to be consistent with the predefined persona profile which specifies the personality that the chatbot needs to present in the conversation. However, expressing proper and rich persona information in responses is usually not so easy in practice. The model needs to carefully consider the specific content (e.g., "football games" in context X1, "the street band performing" in context X3) in the context to select which persona to use as the character of the current conversation. On the other hand, from Table 1 we can see that responses are mainly dominated by the persona of the speaker in many conversation scenarios, which means once the persona is determined, the model needs to focus more on how to express this persona effectively. In that case, the specific information in the conversation context will become less significant for the response generation.
In this paper, we propose a novel Persona Enhanced Dual Alternating Learning Network (PEDNet) to achieve an appropriate and rich persona expression in responses. PEDNet is composed of a Context-Dominated Network (CDNet) and a Persona-Dominated Network (PDNet). They are built upon a common encoder-decoder backbone with shared parameters. Specifically, CDNet is a general persona-based dialogue model equipped with a Memory Network (Sukhbaatar et al., 2015) which is capable of reading and conditioning on persona profile. It is reinforced with a persona-aware representation to select persona properly. Different from CDNet, PDNet generates responses by conditioning on the optimal persona pre-labeled by BERT (Devlin et al., 2019) and a general representation of the conversation context that contains no specific content (e.g., "Do you see...today?" of context X1 in Table 1). PDNet is specially designed to emphasize the contribution of persona in dialogue modeling by ignoring the specific content of the context, while CDNet ensures the semantical relevance between the context and the generated response. PEDNet can learn the capabilities of each sub-network by training CDNet and PDNet alternately with a multi-task training approach.
The main contributions of this work are summarized as follows: • We propose a novel Persona Enhanced Dual Alternating Learning Network (PEDNet) to learn an appropriate and rich persona expression in the generated response, by explicitly distinguishing the effects of the specific content in conversation context on the persona selection and persona incorporation respectively.
• The proposed PEDNet consists of two backbone-shared sub-networks with different learning purposes: CDNet learns to select the persona properly and PDNet focuses more on enhancing the utilization of the persona information. CDNet and PDNet are trained alternately with a multi-task training approach to benefit PEDNet with the both capabilities they have learned.
• Experimental results on the Persona-chat dataset show that our model can generate more personalized responses than baselines. As a side contribution, we will release our code 1 to facilitate the research community.

Related Work
The success of Seq2Seq motivates the development of the research for improving the quality of response generation. However, the problem of tending to generate generic or inconsistent responses still remains in this domain. One promising approach to alleviate such a challenge is to find a better objective function (Li et al., 2016a). Besides, learning the inherent attributes is another way to improve the diversity of responses and ensure the contextual consistency. Li et al. (2016b) first proposed a persona-based model by using user embeddings, which projects each user into a vector and feeds the vector to the decoder at each decoding step. Zhang et al. (2017) employed a two-phase training approach, which initialized the model using large scale data and then fine-tuned the model to generate personalized responses. Kottur et al. (2017) explored a response generation model conditioned on both speakers and context history. Nevertheless, these models cannot explain how the personality expressed in the response is captured since all the information about the user is encoded as a dense vector for decoding.
To maintain a coherent personality, Qian et al. (2018) designed a model explicitly expressing a profile value in the response according to its prespecified persona profile. Zhang et al. (2018) created a Personachat dataset, and they proposed two generative models to incorporate persona information into responses. Yavuz et al. (2019) used a copy mechanism which allowed the decoder to hierarchically attend and copy from the external persona profile in addition to the conversation context. A persona selection mechanism was proposed in Lian et al. (2019), where both prior and posterior distributions over persona are used to facilitate persona selection. Song et al. (2019) proposed a memory-augmented architecture to exploit persona information from context and incorporate a conditional variational autoencoder model together to generate diverse responses.
So far, the studies all generate responses conditioning on both the context and the persona without distinction and emphasis, which often result in insufficient utilization of persona information. Different from those previous works, we design a dual alternating learning network, which explicitly distinguishes the different effects of specific content in the context on the two aspects of persona selection and persona incorporation, so as to realize an effective persona expression.

Task Definition
We follow the definition of persona in Zhang et al. (2018) and work on the explicit textual persona. The task can be formulated as follows: given a context X = (x 1 , ..., x n ) with n words and a persona profile formed of M persona texts P = {p 1 , p 2 , ..., p M }, the system is supposed to generate a response Y = (y 1 , ..., y m ) with m words that is coherent with this persona profile. The response generation process can be briefly stated as:

Encoder-Decoder Backbone
Our model is implemented with a GRU-based Chung et al., 2014) Seq2Seq backbone, which consists of two components: an encoder and a decoder. The encoder aims to encode a context X to a sequence of hidden states. We define u (X) , h (X) = Encode(X), where u (X) denotes the final state of the GRU and h (X) = (h 1 , ..., h n ) denotes the outputs of the GRU cell at all steps. The decoder takes as input a context vector c t and the embedding of a previously decoded word y t−1 to update its hidden state s t with another GRU. The context vector c t is an attention-weighted combination of h (X) and dynamically attends on context information during decoding (Bahdanau et al., 2015). We define this operation as c t = Attention(s t−1 , h (X) ). Once the state vector s t is obtained, the decoder generates a token according to the output probability distribution over the vocabulary: where [·; ·] represents a vector concatenation.

Persona Enhanced Dual Alternating Learning Network
PEDNet is proposed to improve the persona expression in responses and it consists of two sub-networks: PDNet and CDNet. The architecture overview of PEDNet is given in Figure 1. First, CDNet is used to Figure 1: The architecture of the proposed PEDNet. CDNet and PDNet share a common encoderdecoder backbone and are trained alternately using a multi-task training approach. X is the conversation is the congeneric contexts set retrieved from the corpus, p * is the optimal persona pre-labeled by BERT.
select a persona and ensure that the generated response is semantically relevant to the current context. Besides, in order to prevent the specific content in the context from dispersing the model's attention in persona incorporation, PDNet takes as input a pre-given optimal persona and a general representation of the conversation context without the specific content. PEDNet can be benefited by both PDNet and CDNet to learn an effective persona expression through a multi-task training approach during the training phase. In the testing phase, PEDNet directly generates responses based on CDNet.

Specific Content Extraction
As mentioned in Mou et al. (2016), Pointwise Mutual Information (PMI) (Church and Hanks, 1990) is an appropriate statistic for specific words prediction, which is also adopted in our work to extract the context words which have high mutual information against the persona profile. Suppose that we have an input sample (X, Y, P ), given a word w x in X and a word w p in P , the PMI(w x , P ) for the word w x against the persona profile P is calculated below: Finally, we choose the specific words Q by setting a threshold 2 to filter out those words with low PMI scores and limit the maximum number of specific words to 5.

Context-Dominated Network
This network generates responses conditioning on both the context X and persona profile P = {p 1 , p 2 , ..., p M }. Figure 1 depicts the details of CDNet. Specifically, CDNet needs to select a persona p i and then generates a response by integrating p i . The details are as follows: We convert X and each persona p i into vector representations using a context encoder and a persona encoder respectively: Then we compute a specific content representation q = w∈Q Ψ(w) by a summation over word embeddings in Q, where Ψ(·) transfers a word to its embedding. After that, we obtain a persona-aware representation z by concatenating u (X) and q. We select a persona by a multi-layer Memory Network: where k denotes the layer number and m 1 = z T . In this work, a 3-layer Memory Network is used 3 . We select the persona with the highest probability in the last layer of the memory network and fit it with the optimal persona:p = p j , where j = argmax i (prob b 3 i ). Finally, the decoder generates a response by the selected personap. The hidden state of the decoder at time t is: where e(y t−1 ) is the word embedding of y t−1 .

Persona-Dominated Network
Compared with CDNet, the inputs of PDNet are made some adjustments. Specifically, we remove the specific content in X by replacing the specific words Q with special symbols "< KEY >", and X is changed into a revised formX. Since a single context may have difficulty in providing sufficient general content, we further retrieve extra K−1 revised contexts which are similar toX from the corpus using Gensim 4 , to construct a congeneric contexts set {X i } K i=1 . Meanwhile, we use a pre-trained BERT model to calculate the cosine similarity between Y and each persona text in P , so as to select a text p * with the highest similarity score as the optimal persona for the current conversation. In calculating cosine similarity, the sentence representations of Y and P are obtained by averaging the outputs of the second last hidden layer of BERT. Thereafter, we obtain a new model input where p * represents the optimal persona of this turn conversation. The details of PDNet is also given in Figure 1. Similar to CDNet, {X i } K i=1 and p * are first encoded respectively: These vectors {u (X i ) } K i=1 are transformed into a general representation of the revised contextX through an average pooling operation: s 0 = 1 K u (X i ) . The s 0 is used as the initial hidden state of the decoder.
In the decoding process, the decoder generates the response word by word sequentially by incorporating the optimal persona p * . The hidden state of the decoder at time t is: wherec t is computed by a series of sub-attentions, and each sub-attention is responsible for attending the details of those congeneric contexts.

Multi-Task Training
As mentioned in Luong et al. (2016), multi-task learning for Seq2Seq can improve the performance of a task using other related tasks. In our work, PDNet and CDNet are tied together in the training phase through a multi-task training approach. We use two independent tasks to train these two networks alternately, where the encoder-decoder backbone can be benefited from both: • The Context-Dominated task: We expose CDNet with {X, Y, P } training examples.
• The Persona-Dominated task: We expose PDNet with {{X i } K i=1 , Y, p * } training examples. Specifically, in each batch, all training data is sampled from one task only. For each batch, we randomly select a task from these two tasks, where the probability of the Persona-Dominated task is γ and the probability of the Context-Dominated task is 1 − γ.
The loss function is the cross-entropy error between the predicted token distribution and the gold distribution in the training corpus. As aforementioned in Section 3.3.3, the optimal persona text is labeled in advance, so an additional cross-entropy loss is also added to ensure the accuracy of persona selection for CDNet.

Dataset
We conduct our experiments on a recently created dataset, namely the Persona-chat dataset (Zhang et al., 2018). In Persona-chat, each conversation was constructed from crowd-workers who were randomly paired. To produce personalized conversations, each worker was asked to chat with each other conditioned on a given persona profile. This dataset contains 164,356 utterances over 10,981 dialogs and has 1155 persona profiles, each consisting of a set of persona texts. We adapt this dataset into a single-turn conversation paradigm and each turn corresponds to a context-response pair and a persona profile. We divide this dataset into 65,490/7800/7512 turns for training/validation/testing.

Models for Comparison
We compare the proposed PEDNet with the following baseline models: • Seq2Seq: a vanilla Seq2Seq with attention, which does not incorporate external persona profiles. (Vinyals and Le, 2015).
• Generative Profile Memory Network (GPMN): a persona-based dialogue model, where the persona is stored as memory representation in a Memory Network and fed into the decoder with attention. (Zhang et al., 2018). Among these baselines, Seq2Seq is compared for demonstrating the effect of integrating persona profiles in response generation while other models, which incorporate persona profiles, are compared to verify that the effectiveness of our PEDNet.

Implementation Details
We use PyTorch to implement the proposed model. We apply 2-layer GRU structures with 800 hidden states for both the encoder and decoder with a dropout rate of 0.2. We set the size of word embedding to 300 and initialize it by using GloVe (Pennington et al., 2014). The input contexts, generated responses and persona texts share the same word embedding layer. The vocabulary size is limited to 20,000. The Adam optimizer is used to update the gradient and the gradient is clipped in 5.0. We train the model with a minibatch size of 128 and an initial learning rate of 0.0005. We choose the model parameters that yield the highest score to each metric on our validation set.

Model
Distinct-1/2 P-R/P/F1 P-Cover  Table 2: The automatic evaluation on the Persona-chat dataset. K is the number of the contexts in the congeneric contexts set.

Metrics:
We adopt several automatic metrics to evaluate the models, and the results are summarized in Table 2. Since the goal of the proposed model is to generate personalized responses, we use Distinct-1/2 (Li et al., 2016a) to evaluate the word-level diversity of generated responses. Besides, we also use Persona Recall/ Precision/F1 and Persona Coverage, adapted from  and Song et al. (2019) respectively, to evaluate how well the selected persona is expressed. Specifically, given the set of non-stop words in generated response Y and each persona text where |W Y | and |W p i | denote the amount of words in W Y and W p i respectively, and W Y ∩p i is the words shared by W Y and W p i . The Persona Coverage (P-Cover) is computed by: where idf j is the inverse document frequency for w j ∈ W Y ∩ p i : idf j = 1/(1 + log(1 + tf j )), and tf j is from the GloVe index via Zipf's law (Zhang et al., 2018). Results: We test several variants of PEDNet to explore the effect of different values of K on model performance. As shown in Table 2, even if K = 1, our model still outperforms all baselines in persona expression performance as well as performs well in diversity. When K = 15, our model obtains the best performance in persona expression compared with all the baseline models. For instance, comparing PEDNet against PostKS, Persona F1 is increased from 0.084 (PostKS) to 0.109 (PEDNet) and Persona Coverage also achieves a good improvement from 0.037 (PostKS) to 0.045 (PEDNet), which demonstrate our model's ability in enriching persona information in responses. In our experiments, when the value of K exceeds 20, the performance of the model begins to decline. This is probably because there are many contexts without enough similarity to the current context in the congeneric contexts set, leading to an information scattering of the general representation. In order to better evaluate our model, the following experiments are all based on K = 15.

Model
Rel.  In order to further evaluate the performance of our method, we also perform a human evaluation study. Given a context and a persona profile, we generate responses from all the baseline models and our model. These responses are presented to three human annotators along with the context and the persona profile.
Metrics: Three annotators have been applied to score the quality of the responses in terms of Relevance and Persona Expression. The rating ranges from 0 to 2, where 0 means not good, 1 means acceptable, and 2 means excellent. Relevance is defined to measure whether the response is fluency and relevant to its context. Persona Expression is defined to measure whether the response displays sufficient persona information.
Annotation Statistics: We randomly sample 200 responses by each model, resulting in 800 responses in total for human annotation. We calculate the Fleiss' kappa (Fleiss, 1971) to measure inter-rater consistency. Fleiss' kappa for Relevance and Persona Expression is 0.496 and 0.648, indicating "Moderate agreement" and "Substantial agreement" respectively.
Results: The results are shown in Table 3. Our model outperforms all the baselines in both metrics. Specifically, the performance of our model on Persona Expression is improved from 0.867 (PostKS) to 0.983 (PEDNet), indicating that PEDNet can further enrich the persona information of generated responses. Besides, the performance on Relevance of PEDNet is also better than other baselines, which shows that the PEDNet can also ensure that the generated response is semantically relevant to the current context. To investigate the influence of PDNet for persona expression, we also conduct an ablation test where the training probability of the Persona-Dominated task is set from 0.1 to 0.9 (refer to γ in Section 3.4).

Ablation Test
The results are shown in Figure 2. As we can see, when increasing the training probability of the Persona-Dominated task, the metric scores of Persona F1 and Persona Coverage achieve the best values with the probability equal to 0.5. And as this training probability continues to increase, it begins to drop dramatically. We analyze the reasons as follows: If the training probability of the Persona-Dominated task is very small, the model is mainly trained by CDNet, without the assistance of PDNet, the model performs comparatively worse in persona expression, which proves the effectiveness of PDNet. On the contrary, if the training probability of the Persona-Dominated task reaches a higher level, the model is mainly trained by PDNet, which results in the model being unable to see the complete context during the training process, so the model cannot learn to choose the appropriate persona and generate the appropriate responses.

Context
Hi  We further conduct a case study to analyze the quality of the generated responses. Some sampled cases are shown in Table 4. As we can see, the responses predicted by PEDNet contain richer persona information than other baselines. In the first example, the context is asking whether the chatbot has sons. There is no persona directly related to "sons" in the given persona profile, except that the first persona "I have a father and a brother" is regarded as an indirectly related persona of the same type of conversation topic. Compared with other baselines, our PEDNet generates a response that is not only semantically appropriate to its context "I do not have a child", but also display adequate persona content (i.e., "but I have a brother"). The second example also proves the superiority of PEDNet in persona expression.

Conclusion and Future Work
In this paper, we focus on the task of personalized response generation and propose the Persona Enhanced Dual Alternating Learning Network to improve the explicit expression of personality. Two networks that learn through a multi-task training approach are designed to select the appropriate persona according to the conversation context, as well as enhance the utilization of persona information in response generation. Experimental results on both automatic and human evaluations verify the effectiveness and usefulness of our model. For future work, we plan to consider the long-distance conversation history to improve the persona selection and extend our persona integration method in multi-turn conversations.