DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation

We present a large, tunable neural conversational response generation model, DIALOGPT (dialogue generative pre-trained transformer). Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings. We show that conversational systems that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline systems. The pre-trained model and training pipeline are publicly released to facilitate research into neural response generation and the development of more intelligent open-domain dialogue systems.


Introduction
We introduce DIALOGPT, a tunable gigawordscale neural network model for generation of conversational reponses, trained on Reddit data.
Recent advances in large-scale pre-training using transformer-based architectures (Radford et al., 2018;Devlin et al., 2019;Raffel et al., 2019) have achieved great empirical success.OpenAI's GPT-2 (Radford et al., 2018), for example, has demonstrated that transformer models trained on very large datasets can capture long-term dependencies in textual data and generate text that is fluent, lexically diverse, and rich in content.Such models have the capacity to capture textual data with fine granularity and produce output with a high-resolution that closely emulates real-world text written by humans.
DIALOGPT extends GPT-2 to address the challenges of conversational neural response genera-tion.Neural response generation is a subcategory of text-generation that shares the objective of generating natural-looking text (distinct from any training instance) that is relevant to the prompt.Modelling conversations, however, presents distinct challenges in that human dialogue, which encapsulates the possibly competing goals of two participants, is intrinsically more diverse in the range of potential responses (Li et al., 2016a;Zhang et al., 2018;Gao et al., 2019a,b).It thus poses a greater one-to-many problem than is typical in other text generation tasks such as neural machine translation, text summarization and paraphrasing.Human conversations are also generally more informal, noisy, and, when in the form of textual chat, often contain informal abbreviations or syntactic/lexical errors.
Most open-domain neural response generation systems suffer from content or style inconsistency (Li et al., 2016b;Zhang et al., 2019;Gao et al., 2019c), lack of long-term contextual information (Serban et al., 2017), and blandness (Li et al., 2016a;Zhang et al., 2018;Qin et al., 2019).While these issues can be alleviated by modelling strategies specifically designed to boost information content, a transformer-based architecture like GPT-2 (Radford et al., 2018), which uses a multi-layer self-attentive mechanism to allow fully-connected cross-attention to the full context in a computationally efficient manner, seems like a natural choice for exploring a more general solution.Transformer models, for example, allow long-term dependency information to be better be preserved across time (Radford et al., 2018), thereby improving content consistency.They also have higher model capacity due to their deep structure (up to 48 layers in GPT-2) and are more effective in leveraging large-scale datasets (more than 100 million training instances) than RNN-based approaches (Vaswani et al., 2017).
Like GPT-2, DIALOGPT is formulated as an autoregressive (AR) language model, and uses the multi-layer transformer as model architecture.Unlike GPT-2, however, DIALOGPT is trained on large-scale dialogue pairs/sessions extracted from Reddit discussion chains.Our assumption is that this should enable DIALOGPT to capture the joint distribution of P (Target, Source) in conversational flow with finer granularity.In practice, this is what we observe: sentences generated by DIALOGPT are diverse and contain information specific to the source prompt, analogous what GPT-2 generates for continuous text.We have evaluated the pre-trained model on a public benchmark dataset (DSTC-7), and a new 6k multireference test dataset extracted from Reddit postings.DIALOGPT achieves state-of-the-art results in both automatic and human evaluation, lifting performance to near-human response quality.
We have released the source code and a pre-trained model to facilitate future research.

Dataset
The dataset is extracted from comment chains scraped from Reddit spanning from 2005 till 2017.Reddit discussions can be naturally expanded as tree-structured reply chains, since a thread replying to one thread forms the root node of subsequent threads.We extract each path from the root node to the leaf node as a training instance containing multiple turns of dialogue.
We filter the data by removing the instances where (1) there is a URL in source or target, (2) where the target contains word repetitions of at least three words, (3) where the response does not contain at least one of the top-50 most frequent English words (e.g., "the", "of", "a"), since this probably indicates it might not be an English sentence, (4) where the response contains special markers such as "[" or "]", as this could be markup 1 GitHub: https://github.com/microsoft/DialoGPT; Blog: https://aka.ms/dialogpt 2 Our model is also available over Hugging face Transformers.https://huggingface.co/microsoft/ DialoGPT-medium language, (5) where source and target sequences together are longer than 200 words, (6) where the target contains offensive language, identified by phrase matching against a large blocklist.We also excluded a large number of subreddits that had been identified as likely to contain offensive content.In addition, we aggressively filtered out blandness, e.g., removing instances where the responses contained 90% of tri-grams that have been seen more than 1000 times.Often uninformative, such responses account for about 1% of the data.After filtering, the dataset comprises 147,116,725 dialogue instances, in total 1.8 billion words.

Model Architecture
We trained our DIALOGPT model on the basis of the GPT-2 (Radford et al., 2018) architecture.The GPT-2 transformer model adopts the generic transformer language model (Vaswani et al., 2017) and leverages a stack of masked multi-head selfattention layers to train on massive web-text data.The text generated either from scratch or based on a user-specific prompt is realistic-looking.The success of GPT-2 demonstrates that a transformer language model is able to characterize human language data distributions at a fine-grained level, presumably due to large large model capacity and superior efficiency.
Our model inherits from GPT-2 (Radford et al., 2018), a 12-to-48 layer transformer with layer normalization, a initialization scheme that accounts for model depth that we modified, and byte pair encodings (Sennrich et al., 2016) for the tokenizer.We follow the OpenAI GPT-2 to model a multiturn dialogue session as a long text and frame the generation task as language modeling.We first concatenate all dialog turns within a dialogue session into a long text x 1 , • • • , x N (N is the sequence length), ended by the end-of-text token.We denote the source sentence (dialogue history) as S = x 1 , • • • , x m and target sentence (ground truth response) as T = x m+1 , • • • , x N , the conditional probability of P (T |S) can be written as the product of a series of conditional probabilities: Our implementation is based on the opensource PyTorch-transformer repository.3

Mutual Information Maximization
Open-domain text generation models are notorious for generating bland, uninformative samples.To address this problem, we implement a maximum mutual information (MMI) scoring function (Li et al., 2016a;Zhang et al., 2018).MMI employs a pre-trained backward model to predict source sentences from given responses, i.e., P (Source|target).We first generate a set of hypotheses using top-K sampling.Then we use the probability of P (Source|Hypothesis) to rerank all hypotheses.Intuitively, maximizing backward model likelihood penalizes the bland hypotheses, as frequent and repetitive hypotheses can be associated with many possible queries, thus yielding a lower probability for any specific query.
We also attempted to optimize the reward R P (Source|Hypothesis) using a policy gradient (Williams, 1992) with a sample-averaged baseline, following Zhang et al. (2018).The validation reward can be stably improved, but unlike the training under RNN architecture, we observed that reinforcement learning (RL) training easily converges to a degenerate locally-optimal solution, where the hypothesis simply repeats the source sentence (i.e., a parroting model) and mutual information is maximized.We hypothesize that transformers can become trapped in local optima due to their strong model representation power.We leave the investigation of regularized RL training to future work.

Experimental Details
We trained 3 different sizes of the model with total parameters of 117M, 345M and 762M respectively.The model specification follows Radford et al. (2018)  NVLink.We used the Noam learning rate scheduler with 16000 warm-up steps.The learning rate is selected based on validation loss.Each model is trained until there is no progress in validation loss.For small and medium models, we trained the models for up to 5 epochs.For the large model we trained for at most 3 epochs.

Speeding up training
To accelerate the training process and accommodate GPU memory limitations, we first compress all training data into a lazy-loading database file, so that data is loaded only when needed (pre-fetching large chunks to reduce access frequency).We also leverage separate asynchronous data processes to scale the training.As a result, training time declines approximately linearly w.r.t. the number of GPUs.We further employed a dynamic batching strategy to group conversations of similar lengths into the same batch, thus increasing training throughput.

DSTC-7 Dialogue Generation Challenge
The DSTC (Dialog System Technology Challenges) 7 track (Galley et al., 2019) is an end-toend conversational modeling task,4 in which the goal is to generate conversation responses that go beyond chitchat by injecting information that is grounded in external knowledge.This task is distinct from what is commonly thought of as goaloriented, task-oriented, or task-completion dialogs in that there is no specific or predefined goal (e.g., booking a flight, or reserving a table at a restaurant).Instead, it targets human-like interactions where the underlying goal is often ill-defined or unknown in advance, of the kind seen in work and other productive environments (e.g., brainstorming meetings) where people share information.
The DSTC-7 test data contains conversation threads from Reddit data.In order to create a multi-reference test set, we utilized conversation sessions that contain 6 or more responses.Given other filtering criteria such as turn length, this yields a 5-reference test set of size 2208.(For each instance, one of the 6 human responses is set aside to assess human performance on this task.)Note that our training data is collected from a different time span from the test set.
We performed automatic evaluation using standard machine translation metrics, including BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), and NIST (Doddington, 2002).NIST is a variant of BLEU that weights n-gram matches by their information gain, i.e., it indirectly penalizes uninformative n-grams.We also use Entropy (Zhang et al., 2018) and Dist-n (Li et al., 2016a) to evaluate lexical diversity.More details are provided in Galley et al. (2019).
We compared DIALOGPT with our in-house competitive sequence-to-sequence model PER-SONALITYCHAT based on (Li et al., 2016a) and trained on Twitter data, which has been used in production as a Cognitive Service for Microsoft Azure. 5Table 2 summarizes the automatic evaluation results.DIALOGPT with 345M parameters and beam search achieved the highest automatic score across most metrics.Scores for DIALOGPT with 345M parameters are better across the board than with 117M parameters.Beam search (with beam width 10) dramatically improves BLEU and DIST scores, and marginally improves NIST and METEOR.Note that our model is fine-tuned on source-target pairs, and does not leverage grounding information from the DSTC training set.Presumably, the model learns background information during pre-training and is unhindered by the lack of a grounding document.
The automatic scores of DIALOGPT are higher than those for humans.This does not mean that the generation is more "realistic" than human, but is probably attributable to the one-to-many nature of conversation.As illustrated in Figure 1, multiple human responses (R1-R4) can correspond well to a source utterance.Without loss of generality, suppose R1-R3 are the "ground truth" references that will be tested on, while R4 is the "heldout" human response that serves to compute a "human" score.In semantic space, a generated response R g from a well-trained model will presumably tend to lie in the vicinity the geometric center 5 Project PERSONALITYCHAT: https: //docs.microsoft.com/en-us/azure/cognitive-services/ project-personality-chat/overview  This may be close to the geometric mean of all training instances, thus "averaging out" these instances.Consequently, a generated response R g might have a lower "semantic distance" (manifested in higher automatic scores like BLEU) from R1-R3 than the targeted human response R4.

A New Reddit Multi-reference Dataset
We further evaluate DIALOGPT on a multireference test set with 6K examples.The results are shown in Table 3.We test our method on two settings: training from scratch and fine-tuning using GPT-2 as the pre-trained model.In both settings, a larger model consistently outperforms a smaller one.Comparing training from scratch to fine-tuning from the pre-trained GPT-2 model, when applying to smaller model, using GPT-2 model gives larger performance gains.Again, the best system DIALOGPT (345M, w/ beam search) scores higher on BLEU than humans.Larger models trained from scratch (345M and 762M) perform comparably to one finetuned on GPT-2.

Re-ranking The Response Using MMI
We perform mutual information maximization as described in Section 3.2.Specifically, we generate 16 samples for each input source sentence by using top-K sampling (K = 10) using the 345M model fine-tuned from the GPT-2 medium model.This is followed by a re-ranking step using a backward model, which is also a 345M model finetuned from the GPT-2 medium model.The response that yields lowest backward model loss is Table 3: 6K Reddit multi-reference evaluation."Beam" denotes beam search."Human" represents the held-out ground truth reference.
selected for evaluation.The results are summarized in the second to last row in Table 3.It can be seen that compared with Greedy generation, MMI re-ranking produces more diverse responses with higher NIST, METEOR and Entropy and Dist scores, but with a slight drop in BLEU.

Generation Examples
We provide sample generated dialogues in Table 4 (interactive chat) and Table 5 (a self-playing bot with user prompt).The output is based on top-K sampling. 6Interestingly, our model exhibits the ability to address commonsense questions to some extent, presumably owing to the rich amount of information that can be learned from Reddit data.In some cases, instead of giving the "desired" answer, the system generates an alternative, reasonable answer.Our observations suggest that the system is able to deal with multi-turn generation better than an RNN counterpart and tends to be 6 Owing to occasional toxic outputs, we are unable to release the generation code at this time.more consistent with respect to context (Table 5). 7ource Response who is the first president of the United States?
George Washington what is the boiling point of water?
I think it's about 212 F.
which one is bigger, sun or moon?
The sun .
which animal has black and white stripes?
A black and white striped cat .for relevance, informativeness and how humanlike the generating is using a 3-point Likert-like scale.Judges were required to pass a qualification test, and a regime of spam detection was imposed. 8Overall judge preferences for relevance, informativeness and human-likeness, presented as raw numbers and a percentage of the total, are shown in Table 7.A strong preference can be observed for DialoGPT over PersonalityChat.Table 7 also suggests that the "vanilla" DialoGPT medium model may already be close to human response quality.Unexpectedly, we found that judges may prefer the MMI variant over human responses, probably because of many of the true human responses are erratic or idiosyncratic, or are tied to internet memes that happened to be unfamiliar to the judges.9(See Section 4.2 for the conditions underlying this effect.)Further details, including a test of significance and the human evaluation template used, are provided in the Appendix.

Related work
There are several open-sourced toolkits for largescale pre-trained transformer models.Huggingface Conv-AI transfer learning repository (Wolf et al., 2019) contains the code for training conversational AI systems with transfer learning based on the GPT-2 transformer language model, which achieves the state-of-the-art performance on ConvAI-2 dialogue competition.DLGnet (Olabiyi and Mueller, 2019) is a large transformer model trained on dialogue dataset and achieves good performance in multi-turn dialogue generation.AllenNLP (Gardner et al., 2018) is developed as a toolkit for many natural language processing tasks, including the large-scale pre-trained bi-LSTM sentence representation learning framework ELMo (Peters et al., 2018).Texar (Hu et al., 2018)   reverse, disagreeing with otherwise ethical statements).These are known issues in current stateof-the-art end-to-end conversation models trained on large naturally-occurring datasets.A major motive for releasing DIALOGPT is to enable researchers to investigate these issues and develop mitigation strategies.In no case should inappropriate content generated as a result of using DI-ALOGPT be construed to reflect the views or values of either the authors or Microsoft Corporation.

Conclusion
We have released an open-domain pre-trained model, DIALOGPT, trained on massive real-world Reddit dataset.The package consists of a distributed training pipeline and several pre-trained models that can be fine-tuned to obtain a conversation model on a moderately-sized customized dataset in few hours.DIALOGPT is fully opensourced and easy to deploy, allowing users to ex-tend the pre-trained conversational system to bootstrap training using various datasets.It serves as a building block to novel applications and methodologies.Detection and control of toxic output will be a major focus of future investigation.We will investigate leveraging reinforcement learning to further improve the relevance of the generated responses and prevent the model from generating egregious responses.
Source: I would like to report a break-in.R 1 : W a s a n yt h in g st o le n ?R 2 : Is a n yo n e h u rt o r in ju re d ?R 4 : Is th e p e rp e tr a to r st ill in si d e ?R 3 : I w ill se n d so m e o n e ri g h t a w a y.R g: W h e n w a s th is b re a k -i n ?

Figure 1 :
Figure1: A generated response can surpass a human response in automatic metrics.Example responses are fromGupta et al. (2019)

Table 2 :
DSTC evaluation."Team B" is the winner system of the DSTC-7 challenge."Beam" denotes beam search."Human" represents the held-out ground truth reference.

Table 4 :
Addressing commonsense questions

Table 5 :
An interactive example of multi-turn dialogue

Table 6 :
An example of multi-turn self-playing dialogue with user prompt Despite our efforts to minimize the amount of overtly offensive data prior to training, DI-ALOGPT retains the potential to generate output that may trigger offense.Output may reflect gender and other historical biases implicit in the data.Responses generated using this model may exhibit a propensity to express agreement with propositions that are unethical, biased or offensive (or the

Table 7 :
Results of Human Evaluation for relevance, informativeness and human-response possibility, showing preferences (%) for our model (DialoGPT) vis-a-vis its variants and real human responses.Distributions skew towards DialoGPT with MMI, even when compared with human outputs.Numbers in bold indicate the preferred systems.Statistically significant results are indicated: * p ≤ 0.01, ** p ≤ 0.001, *** p ≤ 0.0001, **** p ≤ 0.00001.