Plug-and-Play Conversational Models

There has been considerable progress made towards conversational models that generate coherent and fluent responses; however, this often involves training large language models on large dialogue datasets, such as Reddit. These large conversational models provide little control over the generated responses, and this control is further limited in the absence of annotated conversational datasets for attribute specific generation that can be used for fine-tuning the model. In this paper, we first propose and evaluate plug-and-play methods for controllable response generation, which does not require dialogue specific datasets and does not rely on fine-tuning a large model. While effective, the decoding procedure induces considerable computational overhead, rendering the conversational model unsuitable for interactive usage. To overcome this, we introduce an approach that does not require further computation at decoding time, while also does not require any fine-tuning of a large language model. We demonstrate, through extensive automatic and human evaluation, a high degree of control over the generated conversational responses with regard to multiple desired attributes, while being fluent.


Introduction
Large pre-trained language models (Peters et al., 2018;Raffel et al., 2019) have greatly improved the state-of-the-art in many downstream tasks. These language models are trained using the simple log-likelihood objective over large amounts of unlabeled data (e.g., Wikipedia articles). This approach results in large powerful language models that produce coherent text and can be * Equal Contribution † Work done primarily at the Caltech. 1 Code available at https://github.com/ andreamad8/PPCM HUMAN 1 Hmmmm. I really want to try bull riding. Do you have any interest in that? HUMAN 2 I'd love to try! Can we schedule something for next week? DGPT Yes! Let's get some stuff going.

SPORTS
Ok how about Wednesday at 7pm central? I'll be there with a buddy of mine who is a bull rider. HUMAN 1 Sure! What does your Saturday look like? Table 1: Example of controllable response generation. DGPT is the DialoGPT (Zhang et al., 2019) response, while the others are generated using different attribute models.
used to perform unconditional language generation. However, they provide little control over the text generated. In the context of conversational models, See et al. (2019) show that being able to control the response generation can have a significant impact on the quality of conversations. However, controlled generation from these large conversational models remains a challenge, and is particularly more difficult in the absence of annotated conversational datasets.
For large language models, controlled generation has recently received increased attention. In CTRL (Keskar et al., 2019), the language model is trained to generate based on a control code presented to the model at the start of the context. In Ziegler et al. (2019), GPT-2 (Radford et al., 2019) is fine-tuned using reinforcement-learning with human annotators in the loop to generate contuining text with positive sentiment. Both of these approaches require learning/fine-tuning all of the models' parameters, and new desired attributes cannot be easily incorporated into the generation once the models have been trained. Other approaches that do not alter the language model, but modify the decoding procedure for controlled generation include 1) re-weighting the output distribution using discriminators (Holtzman et al., 2018) or bag of words (Ghazvininejad et al., 2017;See et al., 2019;Baheti et al., 2018), and 2) perturbing the models activation with an attribute model (PPLM) (Dathathri et al., 2019). These approaches, instead, are plug-and-play methods in that they can be used on top of any existing pre-trained language model. These methods, do not modify or train the parameters of the original models and they can achieve comparable performance to finetuning methods (Dathathri et al., 2019). Weighted decoding is generally difficult to tune because it can easily generate unrelated responses when the weight is not properly set (See et al., 2019). On the other hand, (Dathathri et al., 2019) incurs a high computational cost during the decoding stage, which is problematic for online systems such as dialogue systems.
Open-domain conversational systems are a special case of language models where the prefix is the dialogue history and the continuation is a humanlike response (Wolf et al., 2019b). Recently, large pre-training language models trained on unlabeled human-to-human conversation (i.e. Reddit) (Zhang et al., 2019;Adiwardana et al., 2020;Roller et al., 2020) have shown excellent performance in modelling human responses. Similarly, the output of large pre-trained conversational models cannot be directly controlled without having to re-train/finetune the model from scratch, which is practically inconvenient and sometimes impossible since few or no-conversational datasets exist for certain attributes or styles.
On the other hand, plug-and-play methods are a viable solution since they do not require dialogue specific datasets, and they can be computed online on top of existing pre-trained models. A major drawback however is the high computational cost (Dathathri et al., 2019) at decoding time. This is acceptable for language models, where generating paragraphs or stories can be done offline, but it is problematic for online systems such as con-versational models. In this paper, we explore the approach from Dathathri et al. (2019) (PPLM) in large pre-trained dialogue models for controlling the style and topic of the responses without finetuning on any dialogue specific dataset. Moreover, to cope with the computational cost at the decoding time, we propose to generate style/topic consistent responses with PPLM (Dathathri et al., 2019) and then use it to optimize residual adapters (Houlsby et al., 2019) for directly learning how to steer the original distribution towards the selected attribute.
With our extensive automatic and human evaluation, we empirically demonstrate that plug-andplay methods are effective in controlling the response while being computationally efficient. To summarize, our key contributions are: • we show the effectiveness of plug-and-play methods in large pre-trained conversational models using a variety of styles and topics such as Positive, Negative, Question, Sport, Business/Finance, without using dialogue specific dataset.
• we propose to use residual adapters (Houlsby et al., 2019), which adds less than 1.5% taskspecific parameters per style/topic, to make the controllable response generation viable for online systems.
• we run a comprehensive automatic and human evaluation to show that plug-and-play methods can control the generate responses in term of style and topics, without losing fluency.
• we carry out a thorough qualitative analysis on the difficulty of steering conversational models, highlighting current limitations and possible solutions.
Controlled Text Generation Recent methods for controlled generation include fine-tuning models using supervised learning (Peng et al., 2020;Subramani et al., 2019), reinforcement learning (Ziegler et al., 2019), adversarial training (Yu et al., 2017), by pre-training models with control codes (Keskar et al., 2019;Ficler and Goldberg, 2017;Chan et al., 2020), and other various approaches (Zhang et al., 2020b;Sheng et al., 2020;Carbone and Sarti, 2020). Alternatively, weight decoding using both bag-of-words (Holtzman et al., 2018;Ghazvininejad et al., 2017;Baheti et al., 2018;See et al., 2019) and discriminators (Holtzman et al., 2018;Krause et al., 2020), does not require any fine-tuning. Similarly, Dathathri et al. (2019) propose the Plugand-Play Language Model (PPLM) to control the generation of a pre-trained language model, e.g., GPT2 , both in terms of style and topic of the generated text. Finally, residual adapters (Houlsby et al., 2019) has been used to learn multiple language generation tasks (Lin et al., 2020) without fine-tuning the original models' parameters. Concurrently to our work,  compare the performance and tradeoffs of three existing controllable language generation methods on 200 possible styles.

Methodology
A dialogue consists of one or more alternating turns between two speakers. We define the dialogue history at turn t as D t = {U 1 , S 1 , . . . , U t } where U t is the user utterance and S t is the system response. For simplicity, we overload D t to denote the concatenation of sequences across turns with a special token separating the turns. In this paper, we model the dialogue responses using a Transformer (Vaswani et al., 2017)-based Language Model (LM) by using the dialogue history D t as a prefix and then generating the continuation S t in an auto-regressive manner (Wolf et al., 2019c).
Causal Language Modeling Let us denote the concatenation of D t and S t as the sequence of tokens X = {x 0 , . . . , x n }, then we can compute the language model distribution using the chain rule of probability (Bengio et al., 2003) as: (1) Following the notation of Dathathri et al. (2019), we define the transformer decoding process in a recursive manner. Let us define the matrix H t as the key-value pairs from the dialogue history past, i.e., and then x t+1 is sampled from the distribution where W is a linear transformation that maps the hidden state of the last layer o t+1 to a vector of vocabulary size. This efficient transformer implementation (Wolf et al., 2019a) leverages the cached memories to generate x t+1 without recomputing H t .

Plug-and-Play Language Models
PPLM (Dathathri et al., 2019) uses an attribute model (i.e., a classifier) for controlling the generated text. We denote the attribute model as p(a|X) where a is the specific desired attribute to optimize for (e.g., positivity), and X is the generated response so far. At every generation step t, PPLM perturbs the history matrix H t in the direction of the sum of two gradients: i) to maximize the loglikelihood of the attribute a under the conditional attribute model p(a|X) and ii) ensuring high loglikelihood of the generated text under the unmodified conversational language model p(X). The gradient updates are restricted to H t so to preserve the original model parameters.
Let ∆H t be the update to H t to shift the generated text towards possesing the desired attribute a i.e., o t+1 , H t+1 = LM(x t , H t + ∆H t ). At the beginning of the generation, ∆H t is initialized to zero and it is updated using the gradients from the attribute model. Following Dathathri et al.  (2019), we rewrite the attribute model p(a|X) as p(a|H t + ∆H t ) and we define the gradient update for ∆H t as where α is the step size, and γ is the scaling coefficient for the normalization term. Equation 3 is repeated p times depending on how strongly we want the response to be conditioned to the attribute. We study the effect of the step-size α and the number of iterations p on the generated text in detail in Section 6. Subsequently, the new H t = H t + ∆H t is computed and a new token is generated using o t+1 , H t+1 = LM(s t , H t ). The described optimization process is repeated for every token in the generated sequence. As aforementioned, to ensure fluency we also take a step towards minimizing the Kullback-Leibler (KL) regularization between the perturbed and the original distribution. In addition, we also use the Post-norm Geometric Fusion (Stahlberg et al., 2018;Dathathri et al., 2019) for avoiding adversarial generation (Szegedy et al., 2013).
Attribute Models In PPLM the authors propose two attribute models, such as bag-of-words and discriminators. In this paper, we focus on the latter, since discriminators based attribute models do not require human selected keywords. The discriminator is a linear classifier f trained on an annotated dataset with sentence and label pairs as (x, y)note that these sentences do not necessarily need to be conversational responses, as in our case. For each sentence x of length t, we compute the set of hidden states o x :t from the LM, then we compute the mean (ō t ) across time, and finally we train f using the cross-entropy between the label distribution y and f (ō t ).

Residual Adapters
Residual Adapters (Houlsby et al., 2019;Bapna and Firat, 2019) are trainable modules added on top of each transformer layer, which steer the output distribution of a pre-trained model without modifying the original weights. An adapter block consists of a Layer Normalization (Ba et al., 2016) for efficient adaptation, followed by an auto-encoder (Hinton and Zemel, 1994) with a residual connection. Formally, given the hidden representation at layer i denoted as o i :t ∈ R t×d , where d is the hidden size and t is the current generation step, the residual adapter computes: where W E i and W D i are trainable parameters of dimensions d × m and m × d respectively, and LN(·) denotes the layer normalization. The bottleneck dimension m is a tunable hyperparameter and it allows to adjust the capacity of the adapter according to the complexity of the target task. We denote θ i = {W E i , W D i } as the set of parameters for each layer, and Θ = {θ 0 , · · · , θ l } as the total number of parameters added to the model.

Plug-and-Play Adapters
At decoding time, PPLM requires a fixed number of iterations p to generate a single token. This makes the model impracticable for interactive tasks such as conversational models. To cope with this issue, we propose to first use PPLM to generate datasets of dialogues with certain attributes a, denoted as D a = {D 1 , . . . , D n }, and then to optimize the residual adapter parameters to steer the output of the original LM distribution. Hence, for each attribute a, we optimize the parameters in Θ a to minimize the negative log-likelihood over the dataset of dialogues D a . Formally, where each response S k t = {s k 0 , · · · , s k n } is of maximum length n.   (Roller et al., 2020). Since Plug-and-Play Adapters use the generated responses from PPLM, we randomly split the prefixes with 80% for learning the adapter perturbation and the remaining 20% for the final automatic and human evaluation. This is done to have a fair comparison between other baselines and adapters (See Appedix A for more details).

Attribute Models
We train three discriminators covering six attribute models such as Positive, Negative, Question, Sci/Tech, Business and Sport. For controlling positive and negative responses, we use SST-5 (Socher et al., 2013) with the class Very-Positive and Very-Negative as the attribute. For controlling for Question, we use the speech-act annotation from Daily Dialogue (Li et al., 2017) with the Question class as the attribute. To avoid any dialogue related data, we only use the sentences without the corresponding context. Finally, for generating the response about Sci/Tech, Business and Sport, we use the AG-NEWS (Zhang et al., 2015) topic-classification dataset, using the respective classes as attributes. As mentioned in Section 3.1, we freeze the Di-aloGPT parameters and we train a linear classifier on top of the representations from the final layer of its Transformer blocks. Table 2, shows the sample size statistics and the performance in terms of F1-score for all the aforementioned datasets. We also report the current state-of-the-art, to show that a linear classifier trained on top of the DialoGPT activation can reach competitive performance.

Baselines
We compare multiple plug-and-play settings such as: In all the baselines, we sample 10 different hypotheses using multinomialsampling after a top-k filtering (with k = 10), to ensure response diversity (Zhang et al., 2020a), and we select the hypotheses with the lowest attribute model loss as the response. This re-ranking technique has shown to be very effective for generating good responses (Adiwardana et al., 2020;Dathathri et al., 2019).

Evaluation Metrics
We evaluate the generated responses using both automatic and human evaluations. Automatic Eval. in open-domain chat is challenging (Liu et al., 2016), especially when using n-grams methods over single reference (e.g., BLEU (Papineni et al., 2002)). In this paper, no gold-reference response is provided (e.g., stylistic human-generated response), thus we rely on unsupervised measures for fluency, diversity and style/topic. For fluency, we compute the perplex- ity score of the dialogue prefix plus the generate response using GPT2 . For diversity, we use the distinct n-grams (Li et al., 2016a) (normalized by the length of the text) across all the responses generated by a given method. For evaluating the attribute consistency, we train external classifiers using no-overlapping data with the attribute model. For sentiments, we use AMAZON-5 (McAuley and Leskovec, 2013) product reviews. For topics, we use the test-set data of AG-NEWS (Zhang et al., 2015) because we could not find another topic classification dataset with the same classes. For each dataset, we trained a separate BERT (Devlin et al., 2019) (base) classifier with a simple classification head. Table 2 in Appendix B, summarizes the dataset statistics and the performance of the trained scorer.
Human Eval. is the most effective way for evaluating open-domain chat-bots. In this paper, we evaluate two aspects from the generated response: Humanness and Attribute Consistency. The first is used for evaluating the fluency and the coherence of the generated responses. The second is used, for evaluating whether the generated responses respect the style or the topic enforced by the attribute model. We use Acute-Eval  style A/B testing, in which we compare all possible models' pairs (e.g., PP vs. DG etc.). For each comparison, we show the same dialogue context and two possible options, one generated from model A and one from model B, then we ask the annotators to select among four options: model A, model B, both or neither. We collect annotations for both Humanness and Attribute Consistency on 30 dialogues per model comparison and attribute, which amount to a total of 4200 human annotations. Further details are provided in Appendix C.

Results
In this section, we evaluate the proposed methodology to answer three research questions: 1) is it possible to use plug-and-play methods for controlling the output of a large pre-trained conversational model? if so, 2) what are the most effective plug-and-play methods?, and 3) how difficult is to control the response generation given various attributes? To answer the first two questions, we rely on both automatic and human evaluation. Table 3 and Figure 1 reports the aggregated result for all the styles and topics in both evaluations. The breakdown per attribute is reported in Appendix D.

Quantitative Evaluation
Automatic Eval. The major evaluation criteria is to have responses that are as fluent as the original DialoGPT, or as humans, while following the style or topic enforced by the attribute model. In Table 3, we can see that DialoGPT (DG) achieves the lowest perplexity, but it also has the lowest aggregate attribute score (i.e. Score in the Table 3). By analysing the breakdown by style, we can see that by default, the original model has a higher score in both positive style and Sci/Tech topic. We hypothesize that this this is due to two factors: 1) The discussions in Reddit are more often related to Sci/Tech topics. By providing general questions as input, e.g., "What do you do for living?", the model often generate tech related responses, e.g., probably just some fun joy bliss gaming merry great  Figure 2: Contour plot of the normalized sum of the log Perplexity score, computed by GPT2  and the external classifier loss on the generated response by PPLM for the negative and positive style. On the x-axis the number of iteration p and on the y-axis the step size α. Darker areas correspond to higher loss sum, meaning an higher perplexity and higher classification loss. The label represent a sample response from a given iteration and step size.
"I am a computer science student".
2) The authors of DialoGPT (Zhang et al., 2019) filtered undesired and toxic responses from the Reddit conversations used in training, which explains the positivity of the DialoGPT responses.
Using weight decoding (WD) on top of Di-aloGPT leads to an improvement in both the diversity score and the external classifier score. However, WD tends to increases the perplexity score, showing that the generation fluency with respect to the context is lost. In preliminary experiments, we notice that weight decoding generates responses that are not related to the dialogue context but are highly similar to the distribution of the discriminator datasets. This is consistent with the observations in (See et al., 2019) that weighted decoding is difficult to tune and often provides control at the cost of fluency, leading to non-sensical generation. On the other hand, PPLM (PP) is able to achieve a lower perplexity compared to WD while attaining both, a higher attribute consistency score and a high response diversity (dist). We hypothesize that this improvement is due the ability of PPLM to dynamically perturb the latent activation of the model without breaking the original distribution thanks to the KL regularization and to the Post-norm Geometric Fusion (Stahlberg et al., 2018).
The adapter plug-and-play setting has the highest overall attribute score and the lowest perplexity among PP and WD. However, the response diversity, especially dist-1, is lower than for other baselines, meaning that the response may contain repetitive tokens (e.g., "so so bad"). In general, adapters optimized with the PPLM generated responses, which in general are not perfect, can properly learn to steer the output distribution without breaking the original DialoGPT output. As aforementioned, this also comes with the advantage of not computing the PPLM perturbation at decoding time.
Human Eval. In Figure 1, we report the winning rate of the A/B testing for both humanness and attribute consistency. From these tables, we can highlight: 1) There is not statistically significant difference in the humanness score among the multiple methods, even with 210 annotations per cell. In general, all the methods lose with the human response (HM), but not by a large margin. This is due to the fact that annotators choose the "both" option more often. 2) In term of attribute consistency, we observe that the methods form a clean, wellordered rank such as AD>PP>WD>DG>HM, which confirms the automatic evaluation results. Different from humanness, all the results except WD vs. DG are statistically significant (p < 0.05), showing the adapter clearly defeats other methods.
To answer the first two research questions, we observe that both automatic and human evaluation show that plug-and-play methods are suitable for controling response generation. Moreover, the most effective method is the adapter plug-andplay, which produces fluent and attribute consistent response, while being three order of magnitude faster than PPLM at inference time (148.5s/token  vs. 0.123s/token) using a single Nvidia 1080Ti.

Analysis
In this section, we evaluate the difficulty of controlling the response generation for a given attribute.
To do so, we analyse the behaviour of PPLM over two opposite styles (i.e., positive and negative) and then we conduct a qualitative evaluation over the generated responses.

Iteration & Step Size
We analyse the loss of the automatic scorer for fluency and attribute consistency to understand the effects of the number of iterations p and the step size α in Equation 3. Figure 2 depicts the normalized sum of the log Perplexity score, computed by GPT2  and the external classifier loss on the generated response for the negative and positive style. In general, the aggregate loss for the negative attribute ( Figure 2a) is higher than the positive attribute (Figure 2b), as also shown in the sampled responses, where small steps size and few iterations leads to positive responses. However, when both the step size and the iteration surpass a certain threshold, the conditioning becomes very strong and the text generated by PPLM loses its fluency. Overall, this visualization suggests that it is more laborious to control for the negative sentiment with PPLM, and there is a smaller region for the hyper-parameters space where the responses are both fluent and attribute consistent.
Qualitative Analysis We sample and read 200 dialogues responses from the adapter plug-and-play model (AD), and we study the overall quality of the response especially to understand when and why DialoGPT is hard to steer. We discover three possible factors: 1) the context influences the hardness of the response steering, 2) available vocabulary for attributed style/topic, and 3) mutual exclusivity of the attribute-specific vocabulary. 1) Unlike language models that use short pre-fixes (e.g., "The issues ...") to trigger the generation Dathathri et al. (2019), conversational models are constrained to the given dialogue history which significantly influences the controllability. Given an open ended dialogue context (e.g., Table 11 in Appendix), AD generates an impressively natural and on-topic response, but when provided a more constrained dialogue context (e.g., Table 17 in Appendix), AD generates a response that may sound sudden and out of context.
2) Looking at the overall responses, also shown in Table 4, we observe that models use a restricted vocabulary for generating attribute consistent responses. For example, AD frequently generates sentences containing "horrible", "terrible" or "worst" for negative, while "beautiful", "happy" or "wonderful" are more common for positive.
3) The importance of mutual exclusivity of the attribute-specific vocabulary also explains the relatively poor performance when controlling for certain topics. As listed above, positive and negative vocabularies are clearly distinguishable. However, the attribute-specific words for topics such as Business are more generic (e.g., "car", "store") than other topics such as Sport (e.g., "football", "hockey") or Sci/Tech (e.g., "android", "software"). If the attribute-specific words are common and shared across multiple domains, the generated responses may not sound attribute specific even though the correct vocabulary is used.
Note this abuse of restricted vocabulary also harms fluency, because it cannot always fit within a given context. Additional generated examples and statistics of attribute-specific vocabulary on each style/topic are provided in Appendix D. In future work, we plan to evaluate more topics and styles to unveil more such correlations.

Conclusion
We explore plug-and-play methods for controlling the response generation of large pre-trained con-versational models in a light-weight manner while being effective. With extensive automatic and human evaluations, we show that PPLM is able to generate fluent and attribute consistent responses. Further, to overcome the significant computational overhead introduced by PPLM at decoding, we optimize a tiny residual adapter for each attribute based on a few synthetic responses generated using PPLM. The resulting model does not require further computation at decoding time, and outperforms PPLM both in terms of fluency and attribute consistency.