Modeling Personalization in Continuous Space for Response Generation via Augmented Wasserstein Autoencoders

Variational autoencoders (VAEs) and Wasserstein autoencoders (WAEs) have achieved noticeable progress in open-domain response generation. Through introducing latent variables in continuous space, these models are capable of capturing utterance-level semantics, e.g., topic, syntactic properties, and thus can generate informative and diversified responses. In this work, we improve the WAE for response generation. In addition to the utterance-level information, we also model user-level information in latent continue space. Specifically, we embed user-level and utterance-level information into two multimodal distributions, and combine these two multimodal distributions into a mixed distribution. This mixed distribution will be used as the prior distribution of WAE in our proposed model, named as PersonaWAE. Experimental results on a large-scale real-world dataset confirm the superiority of our model for generating informative and personalized responses, where both automatic and human evaluations outperform state-of-the-art models.


Introduction
Over the past decade, a myriad of conversational systems have been proposed in the field of artificial intelligence and achieved remarkable success in various industry scenarios, such as e-commerce assistant  and chit-chat machine XiaoIce (Shum et al., 2018). Based on the domains involved in previous research, existing work can be categorized into two groups, i.e., verticaldomain (Glas et al., 2015) and open-domain , where the former group pursues to complete a specific target with limited domain knowledge while the latter one involves massive topics in conversations. In this work, we focus on * Equal contribution. Ordering is decided by a coin flip. † Corresponding author. the latter one and intend to generate a natural and meaningful response for a given conversation context. Most recent works build upon the sequence to sequence model (Bahdanau et al., 2014) and can generate a fluent response. But they suffer from the notorious "universal response" issue, i.e., generating safe and uninformative responses (e.g., I don't know) (Li et al., 2015).
To address the aforementioned shortcoming, advanced conversational systems propose to capture and incorporate extra information from two different levels, i.e., utterance-level and user-level. As for utterance-level information modeling, previous works mainly construct models upon variational autoencoders (VAE) (Kingma and Welling, 2014). By doing so, responses with diverse and informative words can be generated by introducing a latent variable for modeling utterance-level information such as topic, and syntactic structure (Bowman et al., 2015). It is verified in various open-domain response generation situations that conditional variational autoencoders (CVAE) (Serban et al., 2017; are effective for addressing the "universal response" issue. In user-level information modeling, existing models either implicitly learn user information from training data such as learning user embedding (Li et al., 2015) or explicitly collect user profiles as the accurate personalization (Zhang et al., 2017;. Although obtaining user profiles is more effective and accurate than user embeddings, it is time-consuming and economically costly, or even impossible under the condition of protecting user privacy. We propose the PersonaWAE model, a novel conversational system which simultaneously captures user-level personalization and utterancelevel information as extra hints for generating better responses. Our model is motivated by following two points: 1) existing embedding based per-sona modeling methods cannot discover the common properties among users and train the embedding for different user independently, which cause (equal to learning) a very high-dimensional persona embedding and thus have a low data utilization efficiency or require a large amount of training data for each user. 2) benefited by the semantic capturing ability of WAEs, plenitude persona information can be gathered into the continuous space (Li et al., 2019). To this end, we build our model upon the state-of-the-art conversation model WAE (Gu et al., 2019) to model utterance-level and the user-level information. In the case, user embeddings are utilized as the condition of the prior distribution of the latent variable to formulate a WAE conditional prior. Meanwhile, to further model and fuse the utterance and user-level information, we extend these simple prior distributions to the Gaussian Mixture Distributions (GMDs, more details of the reasons in Section 2.2). After obtaining two GMDs, we combine them into a mixed distribution and regard this mixed distribution as the prior distribution of PersonaWAE. To evaluate the effectiveness of our proposed personalized conversational system, we collect a large dataset with user identifications. Experimental results on both automatic and human evaluation demonstrate that our proposed model can outperform several strong methods, and generate personalized responses for different users.
In a nutshell, our contributions can be summarized as follows: • We proposed a novel personalized Wasserstein autoencoder (PersonaWAE) for open-domain response generation, which incorporates both utterance-level and user-level information; • We proposed to mix two different types of Gaussian mixture distribution as the prior distribution of our model for scaling up the capability of the latent variable; • Experiments performed on a large dataset demonstrate the effectiveness of our proposed model and achieves the new state-of-the-art results.

VAE and WAE
Conditional VAE (CVAE) is a popular framework for dialogue generation Shen et al., 2017. CVAE, as an extension of VAE, supervises the generation process under an extra condition c. To train a CVAE model, the log-likelihood objective log p θ (x|c) is maximized through pushing up its variational lower bound: where q θ (z|x, c) and p θ (z|c) represent the approximated conditional posterior and the conditional prior respectively, log p θ (x|z, c) represents the probability of reconstructing x conditioned on both z and c. Herein KL(·) represents the KLdivergence term, which serves as the regularization for encouraging the approximated posterior q φ (z) to approach the prior p θ (z), i.e. a standard Gaussian distribution. E[·] is the term of reconstruction loss, reflecting how well the decoder performs. The KL-divergence can be replaced by Wasserstein distance which is implemented by  and is proved to be superior to KL-divergence by many experiments. The conditional VAE based on Wasserstein distance is called conditional WAE.

Gaussian Mixture Model
In VAE/WAE frameworks, a variable in latent space is introduced for modeling information in datasets. As demonstrated in Figure 1, each gray point represents a data sample while colorized points refer to noise data. If the latent variable obeys a Gaussian distribution, noise samples (colorized points) will result in inferior responses. Alternatively, a Gaussian mixture distribution can model the datasets more accurately. Conventional VAE and WAE models usually set the prior distribution of the latent variable to a multivariate Gaussian distribution, formulated as where µ and σ 2 represent the mean and variance of N . In our model, we utilize a Gaussian mixture distribution as the prior distribution of the latent variable z, written by where µ k and σ 2 k is the parameter for the k-th gaussian distribution in this multimodal distribution. π k is the weight.

Problem Formulation
We follow the conventional personalized conversation generation research (Li et al., 2016) and formulate the response generation task with the following necessary notations. A dataset with user dialogue history content is firstly given, where c i , r i , m i represent dialogue context, response candidate, and user specific dialogue utterance respectively. Note that we treat the user dialogue utterance for extracting personalization information in multi-turn response generation. Herein, the context is formulated by: c i = (s 1 , s 2 , · · · , s j , · · · , s n i ) where s j represents an utterance in the j-th turn of dialogue context and there are n i utterances in the dialogue context. m i denotes the user specific dialogue utterance. r i = {r i,1 , r i,2 , · · · , r i,nr }, where n r is the length of a target response r i . Then, our task is defined as learning a mapping function f (·) from the given dataset that can yield a personalized response according to the given dialogue context and the user dialogue history.

Proposed Model
As in Figure 2, our proposed personalized Wasserstein autoencoder (PersonaWAE) consists of user personalization modeling and WAE, where details are elaborated as follows.

User Personalization Modeling
Personalization Gaussian Mixture Distribution. To model the user-level information in the continue space, we build the Personalization Gaussian Mixture Distribution (Personalization GMD).
We train vector representations of users (Li et al., 2016) as the user personalizations to facilitate personalized response generation. We denote the trained user embeddings as U = {u 1 , u 2 , . . . , u i } where u i represents the vector representations of i-th user (User i).
Based on the user embeddings as U, we utilize learned user personalizations in the latent space. Specifically, the conditional prior distribution of WAE part is a Gaussian mixture distribution (GMD) conditioned on the learned user embeddings, namely personalization GMD. We formulate the conditional prior as: where {π k , µ k , σ 2 k I} K k=1 represent the GMD (the distribution will deprecate to a Gaussian distribution when K=1) and the parameters of k-th Gaussian distribution are {µ k , σ 2 k }. v k is a component indicator with class probabilities π 1 , π 2 , . . . , π K , where π k is the mixture coefficient of the k-th component of the GMD. We follow (Gu et al., 2019) and compute these parameters as: To obtain v k , we use the Gumbel-Softmax reparametrization to replace the exact sampling: where b i is a sample from U (0, 1), and τ is the softmax temperature to control the sampling process. Fusion of Personalization in Decoder. We also incorporate the user embeddings into the decoder. Concretely, user personalization is used as the input of each updating step to obtain user-specific information for generating personalized responses. Meanwhile, u i is updated by back-propagating loss signal during training. As user personalizations are high-level representations, we further introduce a gating strategy to dynamically balance the user personalization and the current conversation information.

Personalized Wasserstein Autoencoder
Our proposed PersonaWAE consists of encoder, prior and recognition networks, and decoder. Encoder. The encoder encodes a given context by a bidirectional RNN with GRU cells following . Through the encoder, the context c i = (u 1 , u 2 , · · · , u j , · · · , u n i ) is represented as concatenated forward and backward 1 vectors Similarly, the target response r i is represented by the concatenation of states from another bi-directional RNN with GRU cells, denoted as v r,i . The vectors sequence V c is further processed by an RNNs and yields a vector representation v c,i . Note that v c,i refers to c in Equation 1 while v r,i represents x.
Recognition and Prior Networks. We use a recognition network to learn the posterior q θ (z|x, c), we hypothesize that the approximated variational posterior follows an isotropic multivariate Gaussian distribution N (µ, σ 2 I), where I represents the diagonal covariance. Thus modeling q θ (z|x, c) is converted to learn µ and log σ 2 : which is presented as the recognition network in Figure 2. W o and b are trainable parameters.
To approach the prior distribution, we superpose two conditional GMD, where the first 1 → and ← refer to forward and backward, respectively. one is personalization GMD as mentioned before while another conditional GMD that called context GMD is performed on context c. Resemble to the personalization GMD, the parameters of the context GMD p φ (z c |c) are defined as a k , µ k and log σ 2 k , which is learned by: Fusion of two GMDs. In fusing personalization and context GMD, we use the weighted addition strategy to superpose these two distributions, where the resulted new distribution is the prior distribution of PersonaWAE (which is also a GMD).
where W f is a trainable parameter.
Decoder. The decoder is a one-layer GRU network to output the sentence in the generation, which is shown in the right hand of Figure 2. Taking the generation of response r i as an example, the initial state of the decoder is calculated as: where W d is a trainable matrix for dimension transformation. To facilitate the combination of user personalization u i and decoder hidden states, we incorporate a gate module (Tu et al., 2018) in our model: where f is the sigmoid funtion and o t refers to the decoder output in time step t. After processing o t with the softmax operation, the response r i is generated.

Training.
To train our proposed model, we launch the following objective to simultaneously minimize the Wasserstein distance between p θ (z|c, u i ) and q φ (z|x, c), and maximize the reconstructing probability of x: 4 Experiments

Dataset
To evaluate the effectiveness of our proposed personalized WAE model (PersonaWAE), we collect a dataset from an open online chatting forum, i.e., Weibo 2 , which contains massive multi-turn conversation sessions and user identification information. Overall, there are 31,128,520 utterances in the raw dataset with corresponded user identifications. To construct the personalized conversation systems, we retrieve users with more than 14 utterances from the raw Weibo corpus. We also filtrate conversation sessions with less than 2 turns for training multi-turn conversation systems. We use a sliding window with a size of 3 to construct each dialogue session and there are 3 utterances in each dialogue session. By doing so, there are 336,342 conversation sessions in the cleaned corpus. We remove emojis in utterances and utilize NLTK for tokenization. Then, we randomly split the Weibo corpus into 335,342/5,000/5,000 sessions as training/validation/testing sets. For each session, the last utterance is the target response for generation while other utterances are treated as context.

Baselines
In our experiments, we compare our proposed method with the following highly related and strong baselines.
Informativeness Does the response contains informative words ?
Personalization Does the response resembles with any user history? Seq2Seq, the vanilla schema of the sequence to sequence model with attention mechanism (Bahdanau et al., 2014), which is widely used in various generation tasks.
Persona, a typical and recent neural personalized conversation system, which incorporates user-level representations in the generation process (Li et al., 2016).
Adaptation, the domain adaptation solution for building personalized conversation systems (Zhang et al., 2017). We adapt the model in our scenario and we use the tf-idf to obtain the personal words as the user information.
CVAE, which is the conventional CVAE model and trained by KL-divergence. We change our model to use KL-divergence as the training loss.
RL-Persona, the personalized conversational system , which takes the advantages of deep reinforcement learning. We apply the method into our scenario as same as Adaptation.
DiaWAE-GMD, where the former is the stateof-the-art model for open-domain conversation generation (Gu et al., 2019). DiaWAE-GMD employs the Gaussian mixture prior to WAE.

Settings
The dimension of word embeddings is set to 200, which is initialized with pre-trained word2vec vectors 3 . The vocabulary is comprised of the most frequent 31,000 words. The sentence encoder and the context encoder in our PersonaWAE model are two bi-directional RNN with the GRU cells, respectively. The decoder consists of a one-layer RNN with GRUs. The hidden state sizes of both GRU encoder and decoder are set to 256. Each user is allocated a user-level vector representation with dimension size 512. We set the mini-batch size to 100. The SGD optimizer is used to train the autoencoder module with the initial learning rate 1.0, and the learning rate decay strategy is employed. We use RMSprop optimizer (Hinton et al., 2012) to update the parameters of the generator and the discriminator, where the initial learn-

Models
Embedding   ing rates are set to 5e-5 and 1e-5, respectively. The gradient penalty is used for training discriminator (Gulrajani et al., 2017). The value of τ in Gumbel softmax is set to 0.1.

Evaluation Metrics
To evaluate the results of the generated responses, we adopt the following metrics widely used in existing research. Embedding Metrics. To capture the semantic matching degrees between generated responses and ground-truth, we perform evaluations on embedding space. In consistent with previous study (Gu et al., 2019), we compute the similarity between the bag-of-words (BOW) embeddings representations of generated results and reference. In particular, we calculate three metrics:1) Greedy (BOW-Greedy), i.e., greedily matching words in two utterances based on the cosine similarities, and the total scores is then averaged across all words (Rus and Lintean, 2012); 2) Average (BOW-Average), cosine similarity between the averaged word embeddings in the two utterances (Mitchell and Lapata, 2008); 3) Extrema (BOW-Extrema),

Models
Embedding  cosine similarity between the largest extreme values among the word embeddings in the two utterances (Forgues et al., 2014). We report the maximum BOW embedding scores of the 10 sampled responses for each testing context. Overlap-based Metric. We utilize BLEU score (Papineni et al., 2002) to measure n-grams overlaps between ground-truth and generated response. Specifically, we follow the conventional setting in previous work (Gu et al., 2019) to compute BLEU scores using smoothing techniques (smoothing 7) 4 . For each testing context, we sample 10 responses from the models and compute their BLEU scores, i.e., n-gram precision (BLEU-Precision), n-gram recall (BLEU-Recall), and ngram F1 (BLEU-F1).
Human Evaluation. We also employ human evaluation to assess the responses generated by our model and the baselines. Three well-educated annotators are hired to evaluate the quality of generated responses, where the evaluation is conducted in a double-blind fashion. Totally, 200 randomly  sampled responses generated by each model are rated by each annotator with three different aspects, i.e., readability, informativeness, and personalization. Details of the criteria are illustrated in Table 1. Note that it is very difficult to judge whether a generated response resembles with the style of the corresponding user history utterances, and thus we rate the personalization with {0,1}, representing bad or normal. Other criteria are scored from 1 to 3, i.e., bad, normal, and good.

Results and Analysis
In this section, we perform automatic evaluations and human evaluation to measure the quality of the generated responses quantitatively. Meanwhile, we also conduct a qualitative study to intuitively analyze the generated results. Table 2 presents the results of automatic and human evaluation.

The Effect of WAE
WAE can effectively improve the quality of responses but fails to capture personalization. As we mentioned before, we intend to improve the response quality by using WAE. Unsurprisingly, Seq2Seq gets the worst performance. Comparing DiaWAE-GMD with CVAE, we can observe that DiaWAE-GMD significantly improves BLEU scores and BOW scores upon CVAE, which is  shown in Table 2. Such results indicate that the Wasserstein distance and the adversarial training can enhance model learning and address KLvanishing issue in VAEs, as a result of which achieves better results of generated responses, which is also confirmed in the previous research (Gu et al., 2019). Besides, human evaluation results in Table 4 further illustrate that DiaWAE-GMD fails to model personalization of different users insomuch as DiaWAE-GMD lacks user-level information learning.
The number of distributions in conditional Gaussian mixture distribution significantly alter model performance. Table 3 presents the ablation results of the influence of k value in personaliza-tion Gaussian mixture distribution. It is observed that when k ≤ 3 , model performance improves with the increasing of k, which suggests that more distributions in GMD are helpful for modeling user personalization. However, for k > 3, model performance slightly drops with the increasing of k, The potential reason is GMD with three distributions is effective enough for modeling personalization, and sophisticated GMD might suffer scarce datasets for training.

The Influence of Personalization Modeling
User embeddings substantially improve the quality of generated response and introduce personalizations for different users. Through conducting the comparison between PersonaWAE and DiaWAE-GMD. We can learn that incorporating user personalization in decoding step substantially enhance the personalization score of human evaluations, which means user embeddings and the combination in decoder has a positive influence on response quality.
Incorporating personalization in the conditional GMD prior is more effective than combing personalization in decoder. As shown in Table 2, Persona model only achieves comparable results with Seq2Seq in terms of BLEU scores and BOW scores. For PersonaWAE and DiaWAE-GMD, incorporating personalizations in both decoder and the latent space yields a performance improvement. For the BLEU-Recall, which PersonaWAE does not outperform than DiaWAE-GMD, a possible explanation for this might be that PersonaWAE model the personalization information and generation may be more limited.

Discussion
Overall, PersonaWAE outperforms all other baselines on both automatic and human evaluations. Especially for personalization modeling, Person-aWAE achieves a noticeable achievement over the strong baselines DiaWAE-GMD and RL-Persona. These results support that our proposed Person-aWAE is effective in generating personalized response. We also observe that fusing personalized GMD and context GMD as the conditional prior is also useful, which is proven by the results shown in Table 4. Table 5 illustrates the generated response of different models for a given context. We can observe that our proposed model can generate responses with readability and personalization information. Table 6 shows a few example responses generated by altering the user personalization information. With different user representations, the generated responses change, which supports that personalization representation introduced in our model helps learn user-level information. Although it is difficult to evaluate the personalization of generated response and there exists a gap between generated responses and human-comprised ones, quality improvement of responses is achieved. Moreover, we observe that our proposed model might generate a too long and repeated sentence in extreme cases. The potential reason might be the relative short dialog history for each user. Besides, explicitly extracting knowledge and user personalization from conversation history is also promising. These results point out the direction of future work.

Related Work
Constructing an automatic conversation system is an attractive and prevalent task within the community of artificial intelligence. Previous studies mainly focus on vertical domains by applying rule-and template-based models (Pieraccini et al., 2009). Later on, with the explosive growth of data, the application of open-domain conversation model is promising. Conventional methods in vertical domains have obstacles to scale to open domain. Given this, various data-driven approaches have been proposed for modeling open-domain conversation, including retrieval-based methods (Yan et al., 2016;Tao et al., 2019), statistical machine translation model (Ritter et al., 2011), and neural networks (Serban et al., 2015;. Recently, building a personalized conversation system has been attached more attention, e.g., implicitly learning user personalizations from dialog history (Li et al., 2015), explicitly collecting and modeling user profiles as personalizations for generating personalized responses (Zhang et al., 2017. To improve wording diversity, CVAE models (Serban et al., 2017; are well-investigated for opendomain response generation. As the extension of CVAE, Wasserstein autoencoder (Gu et al., 2019) is also used for open-domain response generation to solve the issues of posterior collapse and vanishing latent variables. We build our model upon both the advantages of WAE and personalization modeling for personalized response generation.

Conclusion
Open domain response generation is a challenging task, which involves automatically comprising a response with informative words and personalization. Although prompting progress has been made in wording informativeness, there still exists a noticeable gap between generated response and those created by humans, especially in personalization modeling. To fill this gap, we propose a personalized Wasserstein autoencoder (Person-aWAE) for response generation, where the WAE module improves informativeness by using a continuous latent variable with GMD and user vector representations learned from dialog history is used for introducing personalization information. Experimental results on a large dataset indicate that our proposed model can generate better responses, and outperforms existing models under both automatic and human evaluations.