StyleDGPT: Stylized Response Generation with Pre-trained Language Models

Generating responses following a desired style has great potentials to extend applications of open-domain dialogue systems, yet is refrained by lacking of parallel data for training. In this work, we explore the challenging task with pre-trained language models that have brought breakthrough to various natural language tasks. To this end, we introduce a KL loss and a style classifier to the fine-tuning step in order to steer response generation towards the target style in both a word-level and a sentence-level. Comprehensive empirical studies with two public datasets indicate that our model can significantly outperform state-of-the-art methods in terms of both style consistency and contextual coherence.


Introduction
With advances in neural machine learning (Sutskever et al., 2014;Gehring et al., 2017;Vaswani et al., 2017) and availability of huge amount of human conversations on social media, there has been significant progress on building open-domain dialogue systems with natural language generation techniques. Though neural generative models are notorious for replying with bland responses , some very recent work demonstrates that response generation models learned with pre-training techniques (Radford et al., 2019) can effectively overcome the deficiency suffered by previous models and are capable of having smooth conversations with humans through reasonable and specific replies (Wolf et al., 2019;Zhang et al., 2019b).
The compelling performance exhibited by the pre-trained dialogue models encourages us to explore more difficult yet important problems in conversational AI. In this work, we study stylized response generation, that is responses provided by a * Corresponding Author model should not only be coherent with the conversation contexts, but also be consistent with a designated style. Such research could facilitate developers to customize their dialogue systems in terms of response styles, and thus broaden applications of the systems, from a social companion (Shum et al., 2018) or a virtual assistant (Ram et al., 2018) to a variety of vertical scenarios such as customer service (requiring a polite style), virtual characters in games (requiring specific personas), assistants in specific domains (requiring domain knowledge), etc. Normally, a target style is specified by a non-conversational corpus (e.g., novels, news, blogs, etc.) apart from the paired dialogue corpus (Luan et al., 2017;Niu and Bansal, 2018;Gao et al., 2019). Thus, the major challenge of the task lies in the scarcity of paired data for learning the correspondence between conversation contexts and proper responses in the desired style, which is a key factor in success of the neural dialogue models developed so far. As a result, it is very likely that a response either digresses from the context of the current dialogue (Luan et al., 2017;Gao et al., 2019), or loses fidelity to the target style (Niu and Bansal, 2018).
We consider addressing the challenge by taking advantage of the large scale pre-trained language models. The basic idea is that deep neural language models learned from huge amount of text, such as GPT-2 (Radford et al., 2019) and DialoGPT (Zhang et al., 2019b), have packed enough style knowledge into their parameters (Dathathri et al., 2020), and thus by simply steering the distribution in decoding towards the desired style, we can obtain both contextual coherence and style consistency. Following the idea, we build a response generation model on top of a pre-trained language model and devise both a word-level loss and a sentence-level loss to fine-tune the pre-trained model towards the target style. The word-level loss regularizes the likeli-hood of response generation with a KL divergence term between the probability of dialogues and the probability of stylized language estimated by finetuning a pre-trained language model on the style corpus, while the sentence-level loss maximizes the likelihood of a response given by the pre-trained response generation model being classified as a sentence matching the target style. We employ a Gumbel trick to overcome the obstacle in backpropagation due to the discrete nature of natural language when optimizing the sentence-level loss. The final response is selected by a sample-and-rank strategy to further enhance relevance regarding to the dialogue context and fidelity regarding to the target style. We name our model STYLEDGPT standing for "Stylized DialoGPT". Empirical studies are conducted on two tasks: arXiv-style response generation and Holmes-style response generation with the data shared in (Gao et al., 2019), where responses in the style of scientific papers and the style of Sherlock Holmes novels are pursued respectively for a given context. Besides the style intensity used in (Gao et al., 2019), we further examine style consistency from both a lexical perspective and a syntactic perspective with two new metrics. Evaluation results on both automatic metrics and human judgment indicate that our model can significantly outperform state-of-the-art methods. The code is available at https://github. com/TobeyYang/StyleDGPT.
Our contributions are three-fold: (1) proposal of tackling the problem of stylized response generation with pre-trained language models; (2) proposal of a word-level objective and a sentence-level objective in fine-tuning of a pre-trained language model for the task; and (3) empirical verification of the effectiveness of the proposed method on public datasets.

Related Work
Open-domain Dialogue Generation has received more and more attention in NLP community. Inspired by neural machine translation, early works apply the sequence-to-sequence model to this task and achieve promising results (Ritter et al., 2011;Shang et al., 2015;Vinyals and Le, 2015). Since then, various architectures have been proposed to address the key challenges in open-domain dialogue systems, including suppressing the generic responses Zhao et al., 2017;Xing et al., 2017a), context modeling (Serban et al., 2016Xing et al., 2017b;Zhang et al., 2019a), controlling the attributes of responses (Xu et al., 2019;Zhou et al., 2017;Zhang et al., 2018a;Wang et al., 2018;See et al., 2019) and incorporating different types knowledge into generation (Li et al., 2016;Zhang et al., 2018b;Zhou et al., 2017;Zhao et al., 2020). In this work, we study the problem of stylized response generation, which aims to incorporate the style information from non-parallel data into the generation process.
Stylized Text Generation has attracted broad interest in recent years, especially the style transfer, which aims to alter one or more attributes of text while preserving the content. A prevalent idea of unsupervised style transfer is learning to separate "content" and "style" of text and manipulate the style to induce transfer at inference time Fu et al., 2018;John et al., 2019). However, some works show that the disentanglement cannot be met and is not necessary, and leverage techniques like reconstruction and back-translation introduced in unsupervised machine translation (Lample et al., 2018), transformer  to achieve unsupervised style transfer. Different from style transfer, stylized response generation requires that the response is coherent with its context and the content can be varied. Akama et al. (2017) first train a basic model on a large-scale dialogue corpus and then fine-tune the model with a small stylized corpus. Niu and Bansal (2018) propose three weakly-supervised methods to generate polite responses using non-parallel data. Gao et al. (2019) build a structured latent space sharing between conversation modeling and style transfer. However, limited by the sparsity of the latent space, it is difficult to balance the style and contextual coherence while sampling in the neighborhood of the latent code of context at inference time.
Pretraining Methods have led remarkable success in various NLP tasks which demonstrates its great capabilities in language understanding and text generation (Radford et al., 2018(Radford et al., , 2019Devlin et al., 2019;Yang et al., 2019;Liu et al., 2019;Conneau and Lample, 2019;Clark et al., 2020). Recently, the pretraining methods have also been used to tackle the key challenges in dialogue systems such as context representation (Mehri et al., 2019), response selection (Henderson and Su, 2019), knowledge-grounded response generation (Zhao et al., 2020) and personalized response generation (Zheng et al., 2019). In particular, the large-scale pre-trained open-domain dialogue systems (Zhang et al., 2019b;Adiwardana et al., 2020) make a large step towards human-like chatbot against previous works which rely on complex frameworks developed over many years. On this basis, we propose to study the open-domain stylized response generation with pre-trained models in this work.

Problem Formalization
Suppose that we have a dialogue corpus , X i is a conversation context and Y i a response to X i , and ∀S i ∈ D style , S i is a piece of text in the target style S. We do not assume that there exists pairs {(X, Y )} with Y expressed in the style S 1 , and D style could be collected from text in an arbitrary style (e.g. scientific papers, novels, etc.). Our goal is to learn a generation model P (Y |X, S) with both D conv and D style , and thus given a new context X, one can generate a response Y that properly replies to the context X following the style S.

Approach
We employ DialoGPT (Zhang et al., 2019b) as the general response generation model P (Y |X), and try to bias P (Y |X) towards the language distribution P (S) estimated from D style in fine-tuning. Below, we first briefly review the OpenAI GPT-2 (Radford et al., 2019) and DialoGPT, which serve as the backbone of our model. Then, we introduce two learning objectives from both a word perspective and a sentence perspective to interpolate style S into response generation.

Backbone Networks
GPT-2 is a large transformer based generative model pre-trained with language modeling (Radford et al., 2019). Given a sequence X = (x 0 , · · · , x n ), the generative probability p(X) can be factorized as the product of conditional probabilities over the tokens (Jelinek, 1980;Bengio et al., 2003): where E ∈ R de×|V | is the word embedding matrix with d e the dimension and |V | the vocabulary size, x * t ∈ R |V | is a one-hot vector corresponding to token x t , o x t+1 ∈ R dc is the hidden state at step t with d c the hidden size, and W o ∈ R |V |×dc is a parameter matrix that maps the hidden state o x t+1 to a logit vector in the size of |V |. At inference time, x t+1 is predicted following p(x t+1 |x 0 , · · · , x t ). Moreover, GPT-2 can also be used for language understanding. In this DialoGPT is a large conversational response generation model trained on 147M conversation-like exchanges from Reddit community (Zhang et al., 2019b). It inherits from GPT-2 and frames the response generation task as language modeling. For a context-response pair (X, Y ), a special token |endoftext| is appended at the end of each dialogue turn and then all turns are concatenated into a long sequence. Let M denote the length of the context sub-sequence and (x 0 , · · · , x M −1 , · · · , x N ) denote the dialogue sequence after concatenation, the conditional generation probability of response Y is defined as: (3)

Response Style Controlling
Word-Level Objective encourages the pretrained response generation model P (Y |X) (i.e. DialoGPT) to pick words expressing the desired style S in decoding. Specifically, we train a language model P (S) with D style on the basis of GPT-2 and use it as regularization to drive P (Y |X) towards P (S). It is inspired that if a response Y is not consistent with the style S, it will get high perplexity (i.e. Y is far from the language space of S). Furthermore, P (S) could not only provide an overall evaluation on the fidelity of a response Y , but also assign a direct probability distribution over the vocabulary at each step and thus provide word-level information about which words need to be promoted in generation.
For each (X, Y ) ∈ D conv , we denote p Y = (p y 1 , · · · , p ym ) (m is the length of Y ) as the nextword distributions of Y given by P (Y |X). Meanwhile, we feed Y into P (S) and obtain the nextword distributionsp Y = (p y 1 , · · · ,p ym ). Then the word-level objective is formulated as: where d(p Y p Y ) could be any metrics measuring the distance between p Y andp Y . Here, we specify d(· ·) as the Kullback-Leibler (KL) divergence.
At each step, L w modifies the next-word distribution in the direction of P (S) where the probabilities of words with the desired style S will be increased, which can encourage the selection of these words at inference time.
Sentence-level Objective modifies P (Y |X) towards the target style S from a syntactic and semantic perspective. In training, we hope that a response matching style S could have more impact in guiding the optimization of P (Y |X) towards the desired direction. To this end, we first train a discriminative model P (S|X) to predict whether the input sequence X matches the style S. Formally, given an input sequence X = (x 0 , · · · , x n ), the probability is defined as: where o X = (o x 1 , · · · , o x n+1 ) are the representations of X encoded by GPT-2, average pooling(·) denotes the average pooling layer where the i-th elementô 2 The ratio of the positive and the negative is 1 : 5 in our experiments.
L s aims to regularize the output of the generation model by ascending the probability given by the discriminative model P (S|X), which is similar to the optimization process of the generator in GANs (Goodfellow et al., 2014). The challenge is that since Y is discrete, it is impossible to backpropagate through sampling from P ( Y |X). Although it can be circumvented by using the reinforcement learning (RL) algorithm (Sutton et al., 2000), the performance is not satisfactory in our experiments. In this work, we propose using the Gumbel trick (Jang et al., 2016) to tackle the challenge. At step t, instead of sampling a token from p(x t+1 |x 0 , · · · , x t ), the input vector of step t + 1 is obtained by: where τ is the temperature and when τ → 0, x * t+1 ∈ R |V | becomes a one-hot vector. Training Objective. The two objectives presented above are able to drive P (Y |X) to generate responses with desirable style S, but it will quickly result in irrelevant responses as both of them only focus on responses. To overcome this, we preserve the negative log-likelihood (NLL) loss in DialoGPT to maintain the relevance between the context and response: The final training loss is the weighted sum of the word-level loss, sentence-level loss, and relevance loss: where λ w , λ s , λ N LL are three weight scalars.
Sampling and Ranking. Because it is possible to generate non-stylized responses at inference time, we employ the sample-and-rank decoding strategy following Gao et al. (2019). First, we sample N independent candidate responses for each context by using top-k sampling method with temperature T . Then, we re-rank them in terms of both relevance and style intensity and select the candidate with the highest score as the final response. The score of a candidate Y i for context X is defined as where p(Y i |X) measures relevance of Y i regarding to X, p(S|Y i ) returns style intensity of Y i defined by the discriminative model P (S|X), and β is a hyper-parameter.

Datasets
In order to verify the effectiveness of our model, we experiment on two tasks: generating arXivstyle and Holmes-style responses. The statistics of datasets are summarized in Table 1. The datasets are constructed following the pipeline in Gao et al. (2019). The style corpus D style for arXiv-style response generation task consists of ∼1M sentences that are extracted from the LaTex source code of papers on website arXiv.org from 1998 to 2002 3 . For Holmes-style response generation task, D style contains ∼38k sentences built from ebooks of Sherlock Holmes novel series downloaded from the site Gutenberg.org 4 . Both tasks share the same conversation dataset D conv which consists of 10M context-response pairs extracted from user posts and comments on site Reddit.com during the year 2011 5 . The validation set D val and the test set D test are constructed by filtering the Reddit data in 2013 with the classifier in (Gao et al., 2019) (intensity score > 0.4) 6 . As Gao et al. (2019) do not release their test data, nor specify the size of the test set, we randomly select 2k/2k samples as the validation/test sets, and each context has at least 4 responses.

Evaluation Methodology
We compare different models with both automatic metrics and human judgment.
Automatic Metrics. For automatic evaluation, we measure the quality of generated responses from three aspects: Style Consistency, Relevance, and Diversity. The relevance is measured with BLEU (Papineni et al., 2002) and Rouge (Lin, 2004) 7 . To evaluate diversity, we follow Li et al. (2015) and use Distinct-1 (Dist-1) and Distinct-2 (Dist-2) as metrics which are calculated as ratios of distinct unigrams and bigrams in responses, respectively. In terms of style consistency, existing work only measures the style intensity using classifiers (Gao et al., 2019). However, the style of text is an amalgam, and differences between two styles are reflected in multiple linguistic dimensions (Verma and Srinivasan, 2019). Thus, we propose to evaluate the style of response from three perspectives: (1) Intensity: we report the scores from the discriminative model p(S|X) 8 . (2) Lexical: it is a wordlevel metric that measures the distance between two lexical distributions. We first build a lexicon with all the ngrams (N = 1, 2, 3, 4) from D conv and D style (i.e., Reddit, arXiv, and Holmes corpora). To reduce noise, ngrams that occur less than 10 times are filtered out and there are 1, 346, 175 distinct ngrams left. Then the lexical distributions of a model and the target style can be represented as normalized 1, 346, 175-dimensional vectors with each element the frequency of the corresponding ngram in the generated responses (over the test set) and D style respectively. Finally, we calculate the Jensen-Shannon divergence (Fuglede and Topsoe, 2004) to measure the distance of the two vectors.
(3) Syntactic: it is a sentence-level metric. Motivated by Feng et al. (2012), the style of text can be recognized by the ratio of the following 5 syntactic types: (a) simple; (b) compound; (c) complex; (d) complex-compound; (e) others. The type of a sentence is determined by the algorithm proposed by Feng et al. (2012) which relies on the PCFG tree parsed by the Stanford CoreNLP 9 . We compute the distributions of the style corpus and responses generated by models and report the Jensen-Shannon divergence.
Human Evaluation. We recruit 3 well-educated native speakers as annotators to compare our model with each of the baselines. Each annotator checks one context with two responses at a time with one response from our model and the other from a base-7 Both metrics are computed by scripts of a public NLG evaluation project available at https://github.com/ Maluuba/nlg-eval. 8 The evaluation is more accurate than that from the classifiers available at https://github.com/golsun/ StyleFusion/tree/master/classifier because of the capability of GPT-2. 9 https://stanfordnlp.github.io/CoreNLP

Baselines
We compare our model with the following baselines: (1) MTask: a vanilla multi-task learning model proposed by Luan et al. (2017) trained with both D conv and D style . We use the code implemented by Gao et al. (2019) included in the project https://github.com/golsun/StyleFusion. (2) S2S+LM: the fusion model proposed by Niu and Bansal (2018) that merges the decoder of a seq2seq model trained on D conv and a language model trained on D style by weighted averaging the word distributions at inference time. We use the code published at https://github.com/ WolfNiu/polite-dialogue-generation.
(3) StyleFusion: the regularized multi-task learning model proposed by Gao et al. (2019) which builds a structured latent space to bridge the conversation modeling and style transfer. The model is jointly learned with D conv and D style . We run the code released at https://github.com/golsun/ StyleFusion with default settings. (4) DialoGPT: an open-domain pre-trained response generation model built upon GPT-2 that attains a performance close to human (Zhang et al., 2019b). We use the 345M fine-tuned model which can be downloaded from https://github.com/microsoft/ DialoGPT.

Implementation Details
Our models are implemented with the Huggingface transformers repository 10 . To balance cost and effect, the language model P (S) and the discriminative model P (S|X) are built upon GPT-2 (117M) with 12 layers and 768 hidden units. The embedding layer and the transformer module are shared between two models, and we only optimize the parameters of the projection layer and the classification layer, respectively. We choose DialoGPT (345M) as the basis of STYLEDGPT which has 24 layers and 1024 hidden units. In both tasks, we use the vocabulary published along with GPT-2 by OpenAI that contains 50, 257 tokens. The temperature τ of gumabel softmax is set as 0.1. Hyper-parameters are selected via grid search, and λ w /λ s /λ r are finally set as 0.0005/0.05/1 for the arXiv-style response generation task and 0.005/0.05/1 for the Holmes-style response generation task, respectively. All models are trained with the Adam optimizer (Kingma and Ba, 2015) (β 1 = 0.9, β 2 = 0.999) with a learning rate of 5 × 10 −7 . We choose k = 40 and T = 1.0 in top-k decoding following (Radford et al., 2019;Adiwardana et al., 2020). At inference time, all approaches including our model and baselines generate 50 candidates for each context (i.e. N = 50), and the top one candidate is selected for evaluation    according to Equation (10).

Evaluation Results
Automatic Evaluation. Table 2 reports the evaluation results on automatic metrics. Without any complicated manipulation on latent spaces, STYLEDGPT outperforms the non-pre-trained baselines with large margins on all metrics in both tasks, demonstrating the advantage of pretraining over the state-of-the-art method in stylized response generation. The significant improvement over the vanilla DialoGPT on style consistency indicates that STYLEDGPT can effectively leverage the extra objectives and bias response decoding towards the desired style. Moreover, it seems that forcing responses to a particular style (i.e., arXiv style and Holmes style) is also helpful in relevance, though there is a sacrifice on diversity. This is because the search space in decoding now becomes more concentrated on words that can express the target styles 11 .
Human Evaluation. Table 3 reports the results of human evaluation. The values of kappa are all above 0.6, indicating substantial agreement among the three annotators. We can see STYLEDGPT 11 Note that human responses for calculating the relevance metrics are biased to the target styles according to a style classifier. outperforms all non-pre-trained baselines on the three aspects, which echoes the results of automatic evaluation. Specifically, S2S+LM achieves poor performance on fluency because the weighted average of the token distributions predicted by the language model and the seq2seq decoder harms their attributes of language modeling, which also leads to low relevance. Compared to DialoGPT, we notice that STYLEDGPT significantly improves upon style consistency while achieves comparable performance on relevance and informativeness, which demonstrates the effectiveness of the proposed objectives in fine-tuning.

Discussions
Ablation Study. To understand the roles of L w , L s , and L N LL in learning to generate stylized responses, we remove them one at a time from the full objective in Equation (9), and then check the performance of the variants of STYLEDGPT on the test sets. Table 4 reports the evaluation results. We can see that (1) all the three objectives are useful, as removing any of them will cause a performance drop on some metrics; (2) L w is more important to lexical consistency while L s is more important to syntactic consistency, which echoes our motivation in design of the two objectives; and (3) without L N LL , the model will be misled by the style corpus  Figure 1: Trajectories of ablated STYLEDGPT on the validation set of arXiv-style response generation.
and lose the connection with conversation contexts.
Since L w , L s , and L N LL are coordinated in learning of STYLEDGPT, more insights about the effect of the objectives can be obtained by checking the trajectories of the variants on validation, as illustrated by Figure 1 12 . Without L s , there is a steady and significant improvement on style intensity but dramatic drops on BLEU1, RougeL, and Dist-2 (compared with the model without both L s and L w ), which indicates that L w can provide stronger guidance regarding style expression than L s . On the other hand, comparing STYLEDGPT w/o L w and STYLEDGPT w/o L w & L s , we find that L s can gradually and moderately improve upon style intensity and relevance with only a little hurt on diversity. Finally, when L N LL is removed, the model will quickly forget conversation contexts and converge to the style language model. The full model balances the effect of the three losses and attains both style consistency and contextual coherence, though it has to suffer from diversity drop due to the existence of L w .
Impact of the Sampling Number N . To understand how the sample-and-rank strategy affects model performance, we evaluate STYLEDGPT and StyleFusion by varying the sampling number N in {1, 10, 30, 50} on both tasks. Figure 2 shows the results. We observe that (1) style intensity is more sensitive to the value of N than other metrics; (2) though the two models are comparable in terms of style intensity when N = 1, STYLEDGPT can exhibit the desired styles with fewer samples; (3) STYLEDGPT is always better than StyleFusion on Dist-2, thanks to DialoGPT; and (4) while STYLEDGPT is able to attain both style consistency and contextual coherence with enough samples, it is difficult for StyleFusion to balance the two aspects, as when N increases, both BLEU1 and RougeL drop. This is because when sampling in the neighborhood of the representation of a context in the structured latent space, reaching a stylized but less relevant point becomes easier when the number of samples increases.
Case Study. Finally, we conduct qualitative analysis with some examples given in Table 5 and Table 6. First, we find that the non-pre-trained models can generate interesting responses occasionally (e.g., "the ring of fire" and "the first harry potter movie." in Table 6), but they are not always coherent with the contexts nor specific. In contrast, responses from the pre-trained models (i.e., DialoGPT and STYLEDGPT) are highly coherent with the contexts and rich in content. In terms of style consistency, responses given by STYLEDGPT are more stylized that contain more words specific to a certain style. For example, "constant state" and "quantum fluctuations" are concepts in physical science, "the behavior" of "species" is the research object of ethology, and "accident", "investigate", "discover" and "trap" are topical words in the Sherlock Holmes novel series. i think that's a cute cat. StyleFusion i'm guessing he's a male or female. DialoGPT He was thinking of lubing his ears. STYLE-DGPT This is a very good point and i also would like to discuss some other differences in the behavior of this species. Human some sort of inner monologue. Table 5: The example outputs on the arXiv-style response generation task.

Context
What is your favorite short story ? MTask the first time i saw this i thought it was a good idea to put a ring on it. S2S+LM the ending is a good thing and the most awesome thing i 've done StyleFusion the ring of fire DialoGPT The Last Question by David Foster Wallace

STYLE-DGPT
The one about the day before the accident, about an old lady who used to live in a shack.

Human
The multitude of short stories that my wife writes

Context
If your username was a movie, what would be the plot? MTask the first harry potter movie. S2S+LM there's a lot of things about the movie. StyleFusion it's a trap! DialoGPT Probably The Prestige STYLE-DGPT a story of a mad scientist who goes in to investigate something and discovers he's trapped in a cave Human two lovers escape to the great northwest Table 6: The example outputs on the Holmes-style response generation task.

Conclusions
We exploit the pre-trained language models on the stylized response generation task. To incorporate the style information from the non-parallel data into the generation model, we propose two learning objectives from word level and sentence level to steer the output distribution towards the desired style. Evaluation results on arXiv-style and Holmes-style response generation tasks indicate the effectiveness of the proposed approach.