Regularizing Dialogue Generation by Imitating Implicit Scenarios

Human dialogues are scenario-based and appropriate responses generally relate to the latent context knowledge entailed by the specific scenario. To enable responses that are more meaningful and context-specific, we propose to improve generative dialogue systems from the scenario perspective, where both dialogue history and future conversation are taken into account to implicitly reconstruct the scenario knowledge. More importantly, the conversation scenarios are further internalized using imitation learning framework, where the conventional dialogue model that has no access to future conversations is effectively regularized by transferring the scenario knowledge contained in hierarchical supervising signals from the scenario-based dialogue model, so that the future conversation is not required in actual inference. Extensive evaluations show that our approach significantly outperforms state-of-the-art baselines on diversity and relevance, and expresses scenario-specific knowledge.


Introduction
Neural dialogue generation has drawn increasing attention due to its vast commercial values and practical demands.Typically, given the dialogue history, neural dialogue models, such as plain Seq2Seq model (Sutskever et al., 2014) and Transformer (Vaswani et al., 2017), learn to predict responses via maximum likelihood estimation (Vinyals and Le, 2015;Shang et al., 2015).
Different from other sequence generation tasks, such as machine translation and paraphrase generation, the dialogue generation task can be regarded as a loose-coupling task, which has much freedom in the semantic and the linguistic aspects of the generated responses.However, it is often hard for the existing models to handle such freedom, compared to the fact that humans have no problem in But the ad said they will be sold for a week.
Could you cut the price a little, please?
Figure 1: Examples of one dialogue history and its different responses followed by related future conversations.Different responses imply various conversation scenarios, which can be inferred by different relevant future conversations.
giving specific yet varied responses even for openended dialogue history (Shen et al., 2018;Csaky et al., 2019).One important reason is that we can extend the given dialogue with many possible scenarios of enriched, imaginative background information from our experience and world knowledge, to which existing systems have no access.
It is beneficial for the dialogue systems to build upon such scenarios to facilitate dialogue generation.However, manually annotating the scenario contexts is intractable in terms of both difficulty and quantity.In turn, we find that such scenarios are naturally contained in existing multi-turn dialogue corpora, where the entire dialogue of both dialogue history and future conversation with respect to the current utterance implicitly represents a specific dialogue scenario.An example is given in Figure 1.For Scenario 1, "for a week" in the future conversation suggests the response is related to time.For Scenario 2, "cut the price" indicates that the response contains price information.
Therefore, we reconstruct the dialogue task that only relies on dialogue history into a scenariobased response generation task.In order to enrich the conversation scenario, we employ future conversations together with dialogue histories to learn implicit conversation scenarios, which provide more semantic constraints to guide the response generation.We further propose a novel model to handle this new type of training data consisting of {implicit scenario, response} pairs.
It should be noted that the scenario-based dialogue model relies on future conversations that are inaccessible in inference.Rather than simply searching the training corpora for possible scenarios, we propose an imitation learning framework to drive the conventional dialogue model to absorb the corresponding scenario knowledge from the scenario-based dialogue model.Specifically, the scenario-based dialogue model serves as a teacher, and the conventional dialogue model that relies solely on dialogue history serves as a student that mimics the outputs of the teacher.Under the regularization of scenario knowledge, the student is effectively guided towards a wider local minimum that represents better generalization performance (Chaudhari et al., 2017;Keskar et al., 2017).To facilitate knowledge transfer, the student mimics the teacher on every layer instead of just the top layer, which alleviates the delayed supervised signal problem using hierarchical semantic information in the teacher (Li et al., 2019a).Besides containing the information of future conversations, the distilled knowledge (Hinton et al., 2015) is also a less noisy and more "deterministic" supervised signal in comparison to real-world responses (Lee et al., 2018;Guo et al., 2019), which provides the student with smoother sequence trajectories that are easier to fit.
We highlight our contributions as follows: • We introduce future conversations together with dialogue histories to learn implicit conversation scenarios, which provide more semantic constraints to drive the responses to be meaningful and relevant to the real-world scenario-specific knowledge.
• We propose an imitation learning framework that bridges the gap between training and inference in the accessibility of future conversations.We also demonstrate why imitation learning works and further how to enhance the imitation learning.
• Our model achieves better results than stateof-the-art baselines on four datasets.Extensive analysis demonstrates the effectiveness and the scalability of the implicit conversation scenarios and the proposed imitation learning framework.

Proposed Approach
In this section, we first introduce the scenariobased dialogue model, then describe the imitation learning framework shown in Figure 2, and finally present the training objective.

Scenario-Based Dialogue Model
The conventional dialogue model takes a sequence of dialogue history X = {x 1 , . . ., x T } as input, and generates a response Y = {y 1 , . . ., y T } word by word, where T and T represent the length of source side and target side respectively.The maximum likelihood estimation is usually used to train the model, which can also be expressed as minimizing the negative log-likelihood: where |V| is the size of vocabulary, θ 1 is a set of parameters, and X h represents the input sequence that is from dialogue history.However, dialogue task allows responses to continue the dialogue topic from many aspects or even introduce a new topic depending on various conversation scenarios or semantic constraints, which dramatically increases the difficulty of prediction without any specific scenario information besides the hints from the given dialogue history.Moreover, labeling the scarce scenario information is laborconsuming and impractical.Instead, we resort to easy-to-access but underutilized future conversations that exist in all multi-turn dialogue corpora.By combining the dialogue history and its corresponding future conversation, we introduce the implicit conversation scenario into existing dialogue models to provide more semantic constraints and reduce the difficulty of prediction.
Concretely, we enforce the model to use implicit conversation scenarios to generate responses from two aspects.Different from previous dialogue models only based on the dialogue history X h to predict the response Y , the future conversation X f is also considered as part of input so that the model can look ahead and predict more purposefully.Intuitively, our training pair {(X h , X f ), Y } induces the model to imitate humans to produce the scenariospecific response.
We also redesign the sequence generation architecture to handle the proposed training pair.The attention module in each layer calculates the weight of the contextualized token representations from the encoder based on the information that has been generated in the decoder, and then returns the context c h .In order to consider the future conversation X f , we apply another encoder to produce the contextualized token representations of X f , which will be further extracted as the context c f by the attention module.The new encoder shares the parameters with the original encoder.Meanwhile, the output of the attention module is the concatenation of the past context c h and the future context c f .Finally, the training criterion is formulated as the following negative log-likelihood: where θ 2 is a set of parameters to minimize the NLL loss for the scenario-based dialogue model.

Imitation Learning
In inference, future conversations are inaccessible, which means implicit conversation scenarios cannot be constructed.Thus, the performance improvement from the scenario-based dialogue model cannot facilitate the generation of high-quality responses in practice.In order to bridge this gap between training and inference, we propose an imitation learning framework, in which we regard the scenario-based dialogue model as a teacher and the conventional dialogue model as a student.Through step-by-step imitation, including fine-grained prediction imitation and intermediate representation imitation, scenario knowledge distilled from the teacher regularizes the student to reach a robust local minimum and obtain significant generalization performance in inference.

Fine-Grained Prediction Imitation
Compared with the ground-truth labels, the soft predictions (i.e., the probability distribution from the output layer) contain more fine-grained and valuable information, such as the similarity of labels and potential future conversations.Moreover, the soft predictions provide less noisy and more "deterministic" targets that are easy to mimic.To transfer knowledge from the teacher, instead of taking the one-hot representation of Y as the target, we minimize the cross-entropy of the predicted probability distribution between the teacher and the student:

Intermediate Representation Imitation
Only transferring knowledge from the output layer has a limited effect on the student to use implicit conversation scenarios.When the student network is very deep, the supervised signals from the output layer hardly conduct an effective update and regularization on the parameters of intermediate layers, which will make the imitation learning framework quickly reach saturation (Romero et al., 2015;Sun et al., 2019).This problem prevents the student from scaling to deeper models to further improve the model performance.
To tackle this problem, we extend the range of imitation learning from the soft predictions in the output layer to the output h of intermediate layers to guide the imitation process.Specifically, we penalize the discrepancy of hidden states in intermediate layers between the teacher and the student: where |O| is the number of intermediate layers, h t il and h s il are the outputs of intermediate layers in the teacher and the student respectively, and f (•) is the measurement function.
Because we observe that directly applying the MSE loss as an additional loss hurts the stability of the imitation learning process, we set a scalar threshold α to loose this constraint.

Training
Combining the NLL loss in Equation ( 1) with the IL losses in Equation (3) and Equation ( 4), the final objective function of the student is formulated as: where λ 1 is a hyper-parameter that balances the importance of the NLL loss and the IL losses.
Because the scenario knowledge is only transferred from the teacher by hierarchical supervised signals, our imitation framework has the following three advantages: (1) Compared with the finetuning style of knowledge transfer (Dai and Le, 2015;Howard and Ruder, 2018), the proposed imitation framework does not affect the teacher, i.e., the knowledge learned from the teacher will not be forgotten.( 2) The proposed method is model agnostic.Thus, the imitation object can be extended from one teacher to multiple teachers, such as incorporating a language model besides the scenario-based dialogue model.(3) The imitation process does not change the current objective function, which means the previous work of modifying objective function can serve as a complementary to improve the model performance further.

Datasets
DailyDialog It is provided by Li et al. (2017b), which contains various dialogue topics about daily life.We randomly select 27K, 2.5K, and 1.5K pairs for training, validation, and testing.
PersonaChat It is gathered by assigning two Amazon Turkers with their personas to chat with each other (Zhang et al., 2018a).We only use the conversation section and split it to 67K, 8.5K, and 8K pairs for training, validation, and testing.
OpenSubtitles It is collected from movie subtitles and consists of more than 60M scripted lines (Lison and Tiedemann, 2016).We randomly extract 1500K, 50K, and 25K pairs for training, validation, and testing.
For all datasets, every seven consecutive dialogue turns form a training example, in which the first three turns, the middle turn, and the last three turns are taken as dialogue history, response, and future conversation, respectively.
We also conducted the experiment on a multidomain goal-oriented dataset called MultiWOZ, which is simplified by us as a general dialogue generation task.The detailed description of MultiWOZ and data pre-processing is provided in Appendix A.

Baselines
We re-implemented two classes of six baselines for comparison.The detailed settings of baselines are provided in Appendix B.

LSTM-Based
One class is based on LSTM, including Seq2Seq+Att, which contains a vanilla Seq2Seq model (Sutskever et al., 2014) with attention mechanism (Bahdanau et al., 2015), VHRED+BOW (Serban et al., 2017), which introduces a continuous latent variable attached to the response information into HRED (Serban et al., 2016) and applies BOW loss (Zhao et al., 2017) as a complementary with KL annealing, and NEXUS (Shen et al., 2018), which further uses the future conversation to incorporate more scenario information into the latent variable.

Transformer-Based
The other class is based on Transformer (Vaswani et al., 2017), including itself, ReCoSa (Zhang et al., 2019a), and CHMAM (Tao et al., 2018), which consists of Multi-Head Attention Mechanism (MHAM) and an attention weight regularizer.Both ReCoSa and CHMAM aim to extract more relevant and diverse scenario information from dialogue history.

Experiment Settings
Based on the performance including the loss and metrics on the validation dataset, we trained baselines and our models with the following hyperparameters.According to the scale of the dataset, the vocabulary sizes for OpenSubtitles, DailyDialog, PersonaChat, and MultiWOZ are set to 50k, 20k, 20k, and 18k, respectively.We use separate word embeddings for the encoder and the decoder, and the word embedding dimension is 256.All the parameters are initialized randomly from a normal distribution N (0, 0.0001).All models are trained using Adam (Kingma and Ba, 2015) with a learning rate of 0.001 and gradient clipping at 2.0.The batch size is 128.The hyper-parameters in our proposed appraoch are set as α = 0.01 and λ 1 = 2.0.Our models, i.e., RegDG, the imitating student conventional model, and Transformer-IF, the imitated teacher scenario-based model, are based on Transformer.

Evaluation Metrics
We conducted both automatic and human evaluation to compare the performance of the models.

Human Evaluation
We conducted human evaluation to assess the quality of response.We randomly selected 200 test examples from each dataset and asked three annotators to judge which generated response in each pair (RegDG and baseline) is better (i.e., win, lose or tie) in terms of Diversity (how much the generated response contains meaningful information), Relevance (how likely the generated response is coherent to both dialogue history and future conversation), and Fluency (how likely the generated response is from human).

Automatic Evaluation
The results obtained at the lowest point of the validation loss are shown in Table 1.Our proposed model significantly outperforms all baselines on all datasets.The LSTMbased baselines obtain better performance than Transformer-based baselines in terms of diversity, distribution distance, and relevance, while they lose in grammaticality.It suggests that the LSTM-based model still has a certain advantage in the loose coupling dialogue task.Compared with CHMAM, Re-CoSa achieves higher scores on BLEU and embedding metrics but weaker results on Dist-{1,2,3} and KL divergence, which means that only extracting scenario information from dialogue history cannot provide sufficient semantic constraints to improve model performance across all metrics.Although NEXUS and VHRED+BOW enrich the latent variable and bring more diversity and relevance, they show a distinct decline in PPL.It verifies that our method not only effectively uses the implicit conversation scenario to boost the performance but also indeed transfers this advantage to the inference phase.The improvements of our model on all datasets are significant with p ≤ 0.01 (t-test).The results of MultiWOZ, reported in Appendix D, show similar improvements.Evaluation The results are shown in Table 2.We only report the results of VHRED+BOW, NEXUS, ReCoSa, and CHMAM, which are more related to our work.From the results, we can observe that our model performs better than baselines in all datasets.In particular, our model obtains the most significant win-lose difference on diversity, which demonstrates that the implicit conversation scenario induces the response containing more tokens that are meaningful.We calculate the Fleiss's kappa (Fleiss, 1971) to measure the inter-annotator agreement, and the results are mainly distributed in [0.4,0.6] (i.e., moderate agreement range) with the significance p ≤ 0.01.

Experimental Analysis
In this section, we further quantitatively analyze the effectiveness of future conversations and explore why imitation learning works and how to enhance it.For limited space, we select a set of complementary metrics, Dist-{1,2,3}, PPL, and BLEU, to report the results.The rest of the results is in Appendix F.

Case Study
Table 3 presents some generated responses.The responses generated by baselines are usually dull and meaningless, while the responses generated by our model show diverse and coherent semantic information that indicates distinct relations with those topics in future conversation.The improvements of our model demonstrate the effectiveness of implicit conversation scenarios and our imitation learning framework.Due to limited space, we provide more examples in Appendix E.

Ablation Study
We evaluate the performance of our method without fine-grained prediction imitation (FPI) or intermediate representation imitation (IRI).The ablation study results, reported in Table 4, show that Dialogue history: Well, I am.How much will that cost?// The pass is free.// I don't have to pay for anything?Seq2Seq+Att: I don't like it.VHRED+BOW: Yes, I'm sorry.NEXUS: Yes, we have to pay the cash, please.Transformer: I don't know what I want to do.ReCoSa: I'm going to check out this magazine.CHMAM: I want to get a deposit.RegDG: You need to pay for the monthly sticker.Future Conversation: How much is the monthly sticker?// It's $ 24 for each month.// I'll take the student bus pass.Dialogue history: We're considering of ordering 200 computers, but I'm wondering about the price you'll possibly offer.// Our price will be not less than $ 5000.// Your price is higher than I expected.Could you give us a little discount?Seq2Seq+Att: Yes, I'm afraid I'm going to get it.VHRED+BOW: Yes, we have a credit card.NEXUS: Yes, I need to order the price, but I need to pay the goods in the price.Transformer: I see.I'll take it.ReCoSa: Well, we have to pay a discount for our products.CHMAM: Yes, we'll take a 20%.RegDG: I'm afraid I can't.We don't have any reduction of quality.Future Conversation: But the price is always negotiable and you should consider our quantity of order.// Well, what would you suggest?// Could you make it $ 4500.
Table 3: Examples of the generated responses.The responses generated by our model imply the implicit conversation scenario and contain meaningful information.
both types of imitation are beneficial for knowledge transfer.Without IRI, the model converges to a weaker performance than RegDG.

Effect of Future Conversation
Effect of the Informativeness of Future Conversation We first investigate the situation under which future conversations benefit the generation of high-quality responses.Intuitively, if the true responses or the future conversations are general safe responses, the future conversions contribute little useful information to current dialogue, thereby playing a limited role in the current response prediction.Thus, we classify examples into two sets, i.e., Uninformative and Other.Specifically, Un-  informative includes the examples in which the {dialogue history, response} or the {response, future conversation} is a many-to-one pair.Generally, the second sequence in many-to-one pairs is dull and meaningless (Csaky et al., 2019).To determine the many-to-one pairs, we need to judge whether sentences are of the same meaning and we adopt three measures, that is, whether if the strings match (Exact Match), the words overlap more than 80% (Word Overlap), or the sentences are in the same embedding cluster (Sent.Cluster).
For the detailed settings of the above strategies, please refer to Appendix F. Table 5 shows the results of Transformer-IF on DailyDialog.We can see that the average of all metric improvements of Transformer-IF on the Uninformative set is lower than the Other set, which verifies the assumption that the informativeness of the future conversation supplementing the conversation scenario is crucial to the proposed approach.

Effect of the Capacity of Future Conversation
In order to demonstrate the impact of the information content of the implicit conversation scenario on model performance, we conducted the training and testing of both Transformer and Transformer-IF on DailyDailog (1-1-1) and DailyDailog (3-1-3), respectively."3-1-3" represents that both dialogue history and future conversation consist of three turns, and response only contains one turn."1-1-1" represents that all sequences in the training examples consist of one turn.
The absolute improvements in multi-turn conversation are higher than those in single-turn conversation, which means that Transformer-IF performs better when the implicit conversation scenario contains rich semantic information.Because the automatic metrics may still improve after the lowest point of validation loss (Csaky et al., 2019), the results of both models after 50 epochs of training are reported in Appendix F. It can be observed that Transformer-IF still substantially outperforms Transformer across all metrics under this setting.

Effect of Imitation Learning
Why does the imitation learning work?According to observations in previous work (Chaudhari et al., 2017;Keskar et al., 2017), the model generalization is related to the width of the local minimum achieved by the model.Wider local minima suggests that the model can effectively resist perturbations and obtain better performance on unseen datasets.Therefore, we inject perturbations into the student to judge whether it is guided to a wider local minimum based on the regularization of knowledge transfer.Specifically, we add Gaussian noise with varying magnitude to the parameters of the trained model and observe the perplexity drop on the test set.The results in Figure 3 show that the perplexity of all baselines rapidly increases while the perplexity of our student model grows slowly, indicating that the student model reaches a wider local minimum to gain better generalization.
We also analyze the word distributions of the generated responses to intuitively reflect the effect of regularization from imitation learning.Concretely, we use a vector to represent all generated responses, and each element in the vector represents the frequency of a word.Only 2350 most frequent words are considered as Feng et al. (2020).Then, we calculate the distance between the word distributions from each model and the real-world data.From Table 7, it can be seen that our model significantly outperforms plain Transformer and other baselines, which indicates that knowledge transfer effectively regularizes the model so that the model avoids sticking in a relatively centralized word distribution.
Can imitation learning be accelerated?Before the student mimics the teacher, the teacher is usually well pre-trained.According to our observation, this is a redundant workflow that almost doubles the training time.It is worse if we should train a larger model on a huge dataset.In order to accelerate the training process, instead of transferring knowledge via supervised signals to train a student from scratch, we initialize the specified module of the student directly using the parameters of the teacher, and the transferred parameters are kept from updates during the training process.We call this the hard transfer.We first apply the hard transfer operation on word embedding (Word-Emb), and further extend it to the encoder.The results of Dai-lyDialog in  has been further improved on the diversity with a slight drop on the relevance.Figure 4 shows the variation curve of PPL on the validation set with the training step.The full results are provided in Appendix F. With more hard-transferred modules, the model reaches the lowest point of validation loss faster.It demonstrates that the hard transfer distinctly accelerates the convergence.

Do multiple teachers work?
To take advantage of more diverse and richer prior knowledge, we consider extending the teacher from one to many.We pre-train a transformer-based language model as another teacher.The results are shown in Table 9 with full results in Appendix F. It is clear that with the help of the language model, the student further improves on all metrics, except for a weak decline in relevance, because the language model conducts unconditional sequence generation and does not consider the mapping between the dialogue history and the response.We defer the exploration of balancing multiple teachers in future work.

Related Work
Diversified Dialogue Generation Recently, various researches have focused on neural dialogue models to generate diverse, informative, and relevant responses.One line of research attempts to extract relevant contexts from redundant dialogue history accurately (Xing et al., 2018;Tao et al., 2018;Zhang et al., 2019a).Another line of research tries to explicitly incorporate a latent variable to inject the variability of response in the decoding process (Serban et al., 2017;Zhao et al., 2017).Shen et al. (2018); Gu et al. (2019); Gao et al. (2019) further enriched the latent variable approach.Also, some works redesigned the objective function or automatically learned it by adversarial learning (Li et al., 2016(Li et al., , 2017a;;Xu et al., 2018a;Feng et al., 2020), which improves diversity but brings a fragile training process.Finally, some researchers have adapted external knowledge, such as topic information (Xing et al., 2017), persona (Zhang et al., 2018a), knowledge base (Ghazvininejad et al., 2018).Unlike the above models to pre-dict responses given a dialogue history, our method combines the future conversation with the dialogue history as the implicit conversation scenario, which contains comprehensive background information to guide the response generation.
Imitation Learning Imitation learning, acquiring skills from observing demonstrations, has proven to be promising in structured prediction, such as alleviating the exposure bias problem (Bengio et al., 2015;Zhang et al., 2019b), transferring knowledge to guide non-autoregressive translation model (Gu et al., 2018;Wei et al., 2019), and automatically learning the reward of the dialogue system (Li et al., 2019b).In our work, the conventional dialogue model as a student mimics the scenariobased dialogue model on both the output layer and intermediate layers.

Conclusion
In this work, we introduce the future conversation with the corresponding dialogue history to learn the implicit conversation scenario, which entails latent context knowledge and specifies how people interact in the real world.To incorporate such scenario knowledge without requiring future conversation in inference, we propose an imitation learning framework.The scenario-based teacher model first learns to generate responses with access to both the future conversation and the dialogue history and then a conventional student model is trained to imitate the teacher by hierarchical supervisory signals.
As a result, the student is effectively regularized to reach a robust local minimum that represents better generalization performance.Evaluation on four datasets demonstrates the effectiveness and the scalability of our approach, compared to the state-of-the-art baselines.The proposed framework enables the generation of responses that pertain more closely to the scenario indicated by the given dialogue history.Moreover, detailed analyses illustrate how imitating implicit scenarios regularizes the student model.For future work, we will incorporate pre-trained models into our framework (e.g., BERT as a teacher and GPT as a student) to further unlock the performance improvement and explore how to balance diverse prior knowledge from multiple teachers.

Figure 2 :
Figure 2: Illustration of the imitation learning framework transferring scenario knowledge from the teacher to the student.Top (Teacher): the scenario-based dialogue model.Bottom (Student): the conventional dialogue model.

Table 1 :
The automatic evaluation results at the lowest point of the validation loss.The proposed approach achieves substantial improvements across all the dialogue datasets."↑" means higher is better."↓" means lower is better.

Table 2 :
The human evaluation results.Our model has higher percentages of Win than the baselines.

Table 4 :
Results of the ablation study.

Table 5 :
The average of improvements across all metrics on the Uninformative set and the Other set." * " and " * * " indicate p ≤ 0.05 and p ≤ 0.01, respectively.

Table 6 .
Compared with

Table 7 :
Effect of regularization reflected by cosine similarity between the generated and real-world word distributions.

Table 8 :
Results using the hard transfer.With more hard-transferred modules, the diversity gradually improves, while the relevance gradually weakens.

Table 9 :
Table 8 indicate that the performance Results of multiple teachers on DailyDialog.With the help of a pre-trained LM, the performance is improved consistently.