PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable

Pre-training models have been proved effective for a wide range of natural language processing tasks. Inspired by this, we propose a novel dialogue generation pre-training framework to support various kinds of conversations, including chit-chat, knowledge grounded dialogues, and conversational question answering. In this framework, we adopt flexible attention mechanisms to fully leverage the bi-directional context and the uni-directional characteristic of language generation. We also introduce discrete latent variables to tackle the inherent one-to-many mapping problem in response generation. Two reciprocal tasks of response generation and latent act recognition are designed and carried out simultaneously within a shared network. Comprehensive experiments on three publicly available datasets verify the effectiveness and superiority of the proposed framework.


Introduction
Dialogue generation is a challenging task due to the limited corpus of human conversations, complex background knowledge, and diverse relationships between utterances. Recently, pre-trained large-scale language models, such as BERT (Devlin et al., 2019) and XL-Net , have achieved prominent success in natural language processing. Such models are usually constructed based on a massive scale of general text corpora, like English Wikipedia or BooksCorpus (Zhu et al., 2015), where distributed representations can be learned automatically from the raw text. By further fine-tuning these representations, breakthroughs have been continuously reported for various downstream tasks, especially those of natural language understanding, such as question answering, natural language inference and so on. * Equal contribution. This pre-training and fine-tuning paradigm also sheds interesting light on the tasks of natural language generation, like dialogue generation. However, previous study demonstrates that there are some deficiencies on the performance to apply direct fine-tuning of BERT on small conversation datasets (Rashkin et al., 2019;, where possible reasons might be three-fold: 1) the underlying linguistic patterns in human conversations can be highly different from those in general text, resulting in a large gap of knowledge or data distributions; 2) the training mode of uni-directional dialogue generation is also distinct with that of bi-directional natural language understating as applied in BERT; 3) unlike most of the general NLP tasks, there is a one-to-many relationship existing in dialogue generation, where a piece of context often has multiple appropriate replies.
In this paper, we propose a new method to tackle the above challenges, aiming to obtain a high-quality pre-training model for dialogue generation. First of all, to reduce the gap between data distributions, large-scale Reddit and Twitter conversations are further utilized to pre-train the generation model (upon the basis of language models pre-trained with general text). Secondly, to mitigate the difference of training modes, a flexible paradigm integrating uni-and bi-directional processing is employed in this work, which is inspired by the latest unified language modeling . Thirdly, a discrete latent variable is introduced to model the one-to-many relationship among utterances in conversations.
Each value of the latent variable corresponds to the particular conversational intent of one response, denoted as latent speech act. Distinct with those controllable dialogue generation based on explicit labels (including emotion, keywords, domain codes and so on) Keskar et al., 2019), our latent variable gets exempted from the restriction of human annotations and can be learned automatically from the corpus in an unsupervised way. To pre-train the model for dialogue generation, two tasks are introduced in this work -response generation and latent act recognition. Both tasks are carried out simultaneously under the unified network architecture with shared parameters. Conditioned on the context and latent variable, the generation task tries to maximize the likelihood of the target response. At the same time, the recognition task aims to estimate the latent variable w.r.t. given context and target response. Apparently, the accurate estimation of latent variable is a key factor to boost the quality of response generation.
We conducted experiments on three different kinds of conversation tasks: chit-chat, knowledge grounded conversation, and conversational question answering. Experimental results verify the effectiveness and superiority of our pre-trained model as compared with the other state-of-the-art methods. Our pre-trained models and source code have been released at GitHub, hoping to facilitate further research progress in dialogue generation. 1 2 Dialogue Generation Pre-training Given a piece of context, there exist multiple appropriate responses, leading to diverse conversation flows. It is widely recognized that the capability of modeling one-to-many relationship is crucial for dialogue generation system (Zhao et al., 2017;. To this end, we propose to encode discrete latent variables into transformer blocks for one-to-many relationship modeling, where two reciprocal tasks of response generation and latent act recognition are collaboratively carried out.

Model Architecture
In our model, there are the following three elements: dialogue context c, response r and latent variable z.
• The dialogue context c consists of several history utterances. (For knowledge grounded conversation, the convention is to concatenate background knowledge into the context as well .) • The response r is one piece of appropriate reply towards the given context. • The latent variable z is one K-way categorical variable z ∈ [1, K], with each value corresponds to a particular latent speech act in the response.
The probabilistic relationships among these elements are elaborated as follows (graphical illustration shown in Figure 1). Given a context c, there are multiple appropriate speech acts for replies (represented by the latent variable z). Conditioned on the context and one chosen latent speech act, the response is produced as p(r|c, z) (gray lines). Given a pair of context and response, the latent speech act behind them can be estimated as p(z|c, r) (dashed blue lines). As such, our pretraining of dialogue generation contains the following two tasks -response generation and latent act recognition.
We propose a unified infrastructure for the joint learning of both tasks, shown as Figure 2. The backbone of our infrastructure is inspired by the transformer blocks in , which supports both bi-directional encoding and uni-directional decoding flexibly via specific selfattention masks. Both two tasks of response generation and latent act recognition are carried out under the unified network with shared parameters. Their detailed implementations are discussed as follows.
Given the context c and a specific speech act z, the response generation can be estimated as where T is the length of target response r and r <t denotes the previously generated words. Since the response generation is a uni-directional decoding process, each token in the response can only attend those ahead of it, shown as dashed orange lines in Figure 2. The task of latent act recognition is included to identify the corresponding value of z for the given   context and target response in the training data. The latent act recognition shares network parameters with response generation, but has separate self-attention masks for bi-directional encoding. As shown in Figure 2, with a special mask symbol [M] as input, it keeps collecting information from the context and target response (red lines). In this way, the corresponding speech act for the target response can be recognized as z ∼ p(z|c, r), where p(z|c, r) is the estimated posterior distribution over discrete latent values.

Input Representation
For multi-turn conversation modeling, elaborate designs have been made on the input representation in this work. The network input includes the latent variable, dialogue context and response. Following the pre-processing of BERT (Devlin et al., 2019), the input text is tokenized with Word-Piece (Wu et al., 2016). For each token, its input embedding is the sum of corresponding token, role, turn and position embeddings. One visual example is shown in Figure 3 and details of the embeddings are described as follows: • The input is the concatenation of latent variable, dialogue context and response. A special end-of-sentence [EOS] token is appended to the end of each utterance for separation. Another begin-of-sentence [BOS] token is added at the beginning of the response, whose final hidden state (i.e., output of the last transformer block) is used to predict next token during generation. • Given that z is one K-way categorical variable, its token embedding E [z] is mapped from the latent embedding space E z ∈ R K×D . For the rest tokens in the vocabulary, they are warmed up using BERT's WordPiece embeddings. • Role embeddings are employed to differentiate the characters evolved in the conversation. The role embedding E A is added for the response, as well as dialogue utterances generated by the same character in the context. And role embedding E B is used for the other character. (For knowledge grounded conversation, E C is used as the role embedding of background knowledge.) • In the interactive conversation, there are multi-turn utterances and we employ relative order in the assignment of turn embeddings. The turn embedding for the response is set to E [0] , and the turn embedding of its last utterance is E [−1] , and etc. Our utilization of relative turn embeddings instead of absolute ones enables the model to assign turn embedding E [0] to the response consistently and helps response generation exempt from the disturbance of its round number within the dialogue.
• Position embeddings are added according to the token position in each utterance. Note that for the special token of latent variable, its corresponding role, turn and position embeddings are all set to empty.

Pre-training Objectives
We design three kinds of loss functions for dialogue generation pre-training : negative loglikelihood (NLL) loss, bag-of-words (BOW) loss and response selection (RS) loss. Brief illustration is shown in the last column of Figure 2 and detailed descriptions will be provided in this section.

Response Generation
In our model, the response is generated conditioned on the latent variable and the context. The widely adopted NLL loss is employed in the pretraining: log p(r t |c, z, r <t ) (2) where z is the latent speech act of this training pair (c, r), sampled from the probability distribution p(z|c, r). The posterior distribution over latent values is estimated through the task of latent act recognition: where h [M ] ∈ R D is the final hidden state of the special mask, W 1 ∈ R K×D and b 1 ∈ R K denote the weight matrices of one fully-connected layer.
Besides the classical NLL loss, the bag-ofwords loss (Zhao et al., 2017) is also employed to facilitate the training process of latent discrete variables: where V refers to the whole vocabulary and f is a function that tries to predict the words within the target response in a non-autoregressive way: where h z is the final hidden state of the latent variable and |V | is the vocabulary size. f rt denotes the estimated probability of word r t . As compared with NLL loss, the BOW loss discards the order of words and forces the latent variable to capture the global information of the target response.

Response Selection
Response selection helps distinguish whether the response is relevant with the dialogue context and consistent with the background knowledge. Meanwhile, its score can be regarded as an indicator of coherence during dialogue generation, helping to select the most coherent one from multiple candidate responses. Particularly, the training of response selection is carried out together with the bi-directional encoding network of latent act recognition. The positive training samples come from the dialogue context and corresponding target response (c, r), with label l r = 1. And the negative samples are created by randomly selecting responses from the corpus (c, r − ), with label l r − = 0. The binary crossentropy loss of response selection is defined as follows: The probability is estimated through one fullyconnected layer, with the final hidden state of the special mask fed as input: To sum up, the total objective of our pre-training model is to minimize the integrated loss:

Pre-training Procedure
Our pre-training model contains 12 transformer blocks, with its network parameters initialized using BERT BASE . Large-scale conversation datasets -Twitter (Cho et al., 2014) and Reddit (Zhou et al., 2018;Galley et al., 2019) are employed for pre-training, which result in 8.3 million training samples in total. For each training sample of context and target response (c, r), it needs to pass through the network twice to accomplish the tasks of latent act recognition and response generation. And the pre-training steps are summarized as follows: 1) Latent Act Recognition -Given a pair of context and target response, estimate the posterior distribution p(z|c, r) -Randomly select r − and calculate L RS 2) Response Generation -With the sampled latent value z ∼ p(z|c, r), calculate L N LL and L BOW 3) Optimization -Sum up to obtain L, and update network parameters with back-propagation The hyper-parameters used in pre-training are listed as follows. The maximum sequence length of context and response is set to 256 and 50, respectively. The number of transformer blocks in our model L is 12 and the hidden embedding dimension D is 768. The batch size is set to 64 and K is set to 20 for the discrete latent variable. Adam optimizer (Kingma and Ba, 2015) is employed for optimization with a learning rate of 5e-5. The pre-training of dialogue generation was carried out on 8 Nvidia Telsa V100 32G GPU cards for 3.5M steps, taking approximately two weeks to reach convergence.

Fine-tuning and Inference
Our pre-trained model is flexible enough to support various kinds of dialogues, including chitchat, knowledge grounded conversations, conversational question answering and so on. The finetuning on small conversation datasets can be carried out following the training objectives defined in Equation (8). As the fine-tuning process reaches convergence, the response towards the given context can be obtained through the following inference procedure: 1) Candidate Response Generation -Conditioned on each latent value z ∈ [1, K], generate corresponding candidate response r 2) Response Selection -Calculate the probability for each response p(l r = 1|c, r) and select the one with highest value as the final response It is worth to note that the above fine-tuning and inference procedures are set up for the dialogue generation without any specific objectives. If there exists a specific objective within the conversation, such as letting both participants know more about each other (Bao et al., 2019), the fine-tuning can proceed to maximize the pre-defined rewards with reinforcement learning (RL). Under such circumstances, our latent discrete variable can be naturally treated as action within RL, and thus the response selection can be straightforwardly solved by selecting the action that results in the maximum reward.

Datasets
To evaluate the performance of our proposed method, comprehensive experiments have been carried out on three publicly available datasets. for Audio Visual Scene-aware Dialog of the DSTC7 challenge, is a conversational question answering dataset. In DSTC7-AVSD, the system need to generate an answer given dialogue context and background knowledge. There are two available options of knowledge utilization: 1) using single-modal information of text only, including video's caption and summary; 2) relying on multi-modal information, including text, audio and visual features. The single-modal option is adopted by our method in the experiments. The descriptions and statistics of these datasets are summarized in Table 1.

Compared Methods
The following models have been compared in the experiments.
Baseline. Sequence to sequence with attention (Seq2Seq) (Vinyals and Le, 2015) is employed as the baseline for the experiments on Persona-Chat and Daily Dialog. DSTC7-AVSD has provided a baseline system, which is built upon hierarchical recurrent encoders with multi-modal features. State of the art. The Persona-Chat dataset is also utilized in the ConvAI2 challenge (Dinan et al., 2019a), where the team of Lost in Conversation (LIC) (Golovanov et al., 2019) obtains the best performance. LIC is also one transformer based generation method and fine-tuned upon the pre-trained model of GPT (Radford et al., 2018). For the dataset of Daily Dialog, its best results are reported by the recently developed method -iVAE MI (Fang et al., 2019), which generates diverse responses with sample-based latent representation. In DSTC7-AVSD, the team of CMU (Sanabria et al., 2019) obtains the best performance across all the evaluation metrics.
Our method. To better analyze the effects of latent discrete variable in our method, we also compare to the version without latent variable (Our w/o Latent), under the same training settings. 1

Evaluation Metrics
Both automatic and human evaluations are employed to assess the performance of compared methods. In automatic evaluation, the following metrics are included: • BLEU (Chen and Cherry, 2014) measures the n-gram overlap between generated response and the target response. • Distinct-1/2 (Li et al., 2016) measures the generation diversity, which is defined as the number of distinct uni-or bi-grams divided by the total amount of generated words. • Knowledge R/P/F1 (Dinan et al., 2019b) measures the degree of informativeness w.r.t. background knowledge, defined as: 1 Our w/o latent's network parameters are also first initialized with BERTBASE. The pre-training is then carried out on Reddit and Twitter, with the objective to minimize NLL loss. The fine-tuning follows the same objective as pre-training on down-stream datasets.
where W G and W K refers to the set of nonstop words in the generated responses and background knowledge respectively. • In DSTC7-AVSD, the MSCOCO platform (Chen et al., 2015) is employed for evaluation. It compares the generated response with six ground truth responses, using metrics of BLEU, METEOR, ROUGH-L and CIDEr.
In human evaluation, we randomly select 100 dialogue contexts and generate responses with compared methods. Three crowd-sourcing workers are asked to score the response quality on a scale of [0, 1, 2] from four aspects -fluency, coherence, informativeness and overall. The higher score, the better. Details about the criteria are given as follows.
• Fluency measures whether the generated sentence is smooth and grammatically correct. • Coherence evaluates whether the generated response is relevant with the context and consistent with the expressed information or background knowledge. • Informativeness assesses the information contained in the generated response. • Overall represents the general evaluation, where 0 indicates a bad response, 1 corresponds to a normal response and 2 stands for a good response. After collecting the assessments from three crowdsourcing workers, the response's final score is determined via majority voting. The average Fleiss's kappa (Fleiss and Cohen, 1973) on Persona-Chat and Daily Dialog is 0.515 and 0.480 respectively, indicating annotators have reached moderate agreement.

Experimental Results
The experimental results on Persona-Chat and Daily Dialog with automatic and human evaluations are summarized in Table 2. During automatic evaluation, BLEU-1/2 measures the overlap between generated response and ground truth, Distinct-1/2 assesses the diversity of words in generation and Knowledge R/P/F1 evaluates the information expression w.r.t. background knowledge. However, the results demonstrate that no method can consistently outperform the others under automatic evaluation. As shown in the empirical study (Liu et al., 2016), there is a weak correlation between automatic metrics and human judgments in open-domain dialogue generation. As such, it is    i've been drinking monster energy drinks since i was a kid .
that's a lot of people .
i've never heard of this .
[ relevant ] ( http : / / en . wikipedia . org / wiki / monster _ energy _ drink )  suggested to treat these automatic evaluations as a reference and put emphasis on human evaluations. During human evaluations, it is shown that our method obtains consistently better performance across all the metrics on Persona-Chat and Daily Dialog. The scores of fluency almost approach the upper bound, revealing that our generated responses are very fluent. The informativeness assessments indicate that the information in our generated responses is significantly richer, as compared with the baseline methods. Our responses are coherent with the context and favored most by crowd-sourcing workers. The ablation study with our method and our w/o latent also suggests that through the incorporation of discrete latent variables, remarkable improvements can be achieved for dialogue generation. Besides, it can be observed that the generation quality of transformedbased approaches (LIC and our method) is significantly better than that of RNN-based methods (Seq2Seq and iVAE MI ). 1 The experimental results on DSTC7-AVSD with automatic evaluation are provided in Table  3. Distinct with the above chit-chat datasets, there are six ground truth responses in DSTC7-AVSD, which makes the automatic evaluation become more effective and align better with human judgments. In the experiments, our response selection is strengthened with an extra ranking step, which ranks the candidates according to the automatic scores and selects the top one as the final answer. The results in Table 3 demonstrate that our method has brought a new breakthrough for DSTC7-AVSD. Additionally, the upper bound of our method is also reported, under the ideal scenario that the optimal candidate answer can be selected. 2 The incredible results validate the great potential of our approach.

Case Analysis
To further dissect the quality of our pre-trained model, several examples of generated responses are provided in Table 4. For each piece of context, our model can produce multiple responses by assigning distinct values to the latent variable and five candidate responses are selected for display in the table. It shows that our pre-trained model is able to generate diverse and appropriate responses. Interestingly, as the training corpus includes conversations from Reddit threads, some URLs may interweave with dialogue utterances. It seems that this pattern has been captured by the latent variable and sometimes our model generates related Wikipedia links as the reply.
In Table 5, it provides the cases of our method and compared approaches on Persona-Chat, where two participants chat with each other according to their personas. As shown in the example, participant P2 needs to produce a response towards the given dialogue context, conditioned on his/her persona profile. The baseline Seq2Seq tends to generate common replies with low informativeness and  poor coherence. LIC and Our w/o Latent are able to produce some coherent responses, whereas deficient on informativeness. In comparison, the response by our method is not only coherent with the context, but also expressive of the background personas.

Comparison of Pre-trained Models
To further analyze the effectiveness of our pretrained model, ablation studies have been conducted on Daily Dialog. The compared methods include the baseline Seq2Seq, direct fine-tuning of BERT, GPT-2 (Radford et al., 2019) and our pre-trained model. And there are three different sizes of training dialogues: 1k, 5k and 11k (total training data). The experimental results measured with perplexity are summarized in Table 6. These results demonstrate that our method outperforms the baseline and other pre-training models consistently with lower perplexity across different training sets. Even with the low-resource data of 1k conversations, our model can still obtain prominent performance.
Several interesting conclusions can be also drawn from these results. Firstly, the comparison between BERT and GPT-2 fine-tuning indicates that uni-directional pre-trained models are more suitable for dialogue generation. Secondly, our method obtains better performance than GPT-2, which may result from three aspects: 1) our pretraining is carried out with the datasets of Reddit and Twitter, which are closer to human conversations as compared with general text; 2) in the pretraining, we adopt more flexible attention mechanisms to fully leverage the bi-directional and unidirectional information within the context and response. 3) our model has effectively modeled the one-to-many relationship with discrete latent variable, whose effect has been verified in Table 2.

Related Work
Related work contains pre-trained language models and one-to-many modeling in dialogue generation. Pre-trained Language Models. Pre-trained language models, which are trained with large-scale general text, have brought many breakthroughs on various NLP tasks. These models can be roughly divided into two categories according to their attention mechanisms. GPT (Radford et al., 2018) and GPT-2 (Radford et al., 2019) are representative uni-directional language models, where one token is only allowed to attend its previous tokens and the objective is to maximize left-to-right generation likelihood. BERT (Devlin et al., 2019) and XL-Net  are bi-directional language models, where bi-directional context attention is enabled for token prediction. The latest unified language model UniLM  is able to support both uni-and bi-directional attention with flexible self-attention mask designs. Recently, some attempts (Golovanov et al., 2019; have been made to adapt generative language models GPT or GPT-2 for dialogue generation. Whereas the special issues of conversations, such as impacts from background knowledge and problems of one-to-many relationship, are not fully considered and tackled in these adaptations. One-to-many Modeling. Given one piece of context, there exists multiple appropriate responses, which is know as the one-to-many mapping problem. To model this one-to-many relationship, CVAE (Zhao et al., 2017) employs Gaussian distribution to capture the discourse-level variations of responses. To alleviate the issue of posterior collapse in VAE, some extension approaches are further developed, including conditional Wasserstein auto-encoder of DialogWAE (Gu et al., 2019) and implicit feature learning of iVAE MI (Fang et al., 2019). Besides the continuous representation in VAE, discrete categorical variables are also utilized for interpretable generation . Additionally, multiple mapping modules as latent mechanisms are introduced for diverse generation , where accurate optimization is carried out via posterior mapping selection. However, due to the small scale of annotated conversation data and limited capacity of generation network, it remains challenging for these methods to balance the diversity and fluency during response generation.

Conclusion
A novel pre-training model for dialogue generation is introduced in this paper, incorporated with latent discrete variables for one-to-many relationship modeling. To pre-train our model, two reciprocal tasks of response generation and latent recognition are carried out simultaneously on large-scale conversation datasets. Our pre-trained model is flexible enough to handle various downstream tasks of dialogue generation. Extensive and intensive experiments have been carried out on three publicly available datasets. And the results demonstrate that our model obtains significant improvements over the other state-of-the-art methods.
Our work can be potentially improved with more fine-grained latent variables. We will also explore to boost the latent selection policy with reinforcement learning and extend our pre-training to support dialogue generation in other languages.