Counterfactual Off-Policy Training for Neural Dialogue Generation

Open-domain dialogue generation suffers from the data insufﬁciency problem due to the vast size of potential responses. In this paper, we propose to explore potential responses by counterfactual reasoning. Given an observed response, the counterfactual reasoning model automatically infers the outcome of an alternative policy that could have been taken. The resulting counterfactual response synthesized in hindsight is of higher quality than the response synthesized from scratch. Training on the counterfactual responses under the adversarial learning framework helps to explore the high-reward area of the potential response space. An empirical study on the DailyDialog dataset shows that our approach signiﬁcantly outperforms the HRED model as well as the conventional adversarial learning approaches.


Introduction
Open-domain dialogue generation (Shang et al., 2015a;Vinyals and Le, 2015;Sordoni et al., 2015a) intends to produce coherent responses given dialogue history.Nevertheless, it suffers from data insufficiency problem as there may exist many potential responses for a given dialogue history (Li et al., 2016).An ideal way of exploring the potential responses is to train the model by chatting with real users, which is usually time-consuming and labor-intensive in practice.Although replacing a real user with a user simulator could address the issue, the simulator only roughly approximates real user statistics, and its development process is costly (Su et al., 2016).
In contrast, humans could independently reason potential responses based on past experiences from the true environment.Having observed a response, one might naturally ask himself or herself: "What would happen if I respond differently, while everything else in the environment remains the same." Answering the question will result in a potential response (as an example in Figure 1), and it is beneficial for improving future decision making (Roese, 1997).The potential response inferred in hindsight is called a counterfactual response, where the concept "counterfactual" describes the posterior process of reasoning the outcome of alternative actions (i.e., a different responding policy) that could have been taken while keeping everything else unchanged (Buesing et al., 2019).Motivated by this, we propose a counterfactual off-policy training (COPT) approach to explore potential responses.Building upon the adversarial learning framework, COPT casts a dialogue generator as a structural causal model (SCM), which describes a generation process with two ingredients: scenarios and causal mechanisms (Wright, 1920;Buesing et al., 2019).The scenario is a random noise variable that captures all unobserved yet relevant aspects of the environment, i.e., user profiles.The causal mechanism is a deterministic function that takes a scenario and dialogue history as input and outputs a response.In this way, reasoning a counterfactual response in an observed response's environment can be achieved by feeding the scenario of the observed response into the causal mechanism.After generating the counterfactual response, the generator will receive a reward from a discriminator and optimize itself accordingly.
Intuitively, a counterfactual response is synthesized by grounding the model in the scenario where an observed response occurs, rather than the scenario sampled from scratch as standard adversarial learning-based approaches.This improves the quality of the synthesized responses and subsequently benefits the model that learns from the synthesis.To verify the effectiveness of our approach, we conduct experiments on the public available DailyDialog dataset (Li et al., 2017b).Experimental results show that our approach significantly outperforms previous adversarial learning-based approaches in both automatic and human evaluations.The contributions of this paper are summarized as follows: • We connect the concept of counterfactual reasoning with the dialogue generation by casting the dialogue generation model as a structural causal model.
• Our counterfactual response is of higher quality than the response synthesized from scratch in standard adversarial learning-based dialogue generation model.
• Our approach is model-agnostic and can be applied to any adversarial learning-based dialogue generation model.
Responses of retrieval-based methods come from a fixed candidate response set and thus are incapable of being customized.The generation-based methods can create new responses, but the vanilla sequence to sequence model tends to produce generic responses (Li et al., 2016).One way to address the generic response problem is by introducing external knowledge, such as keywords (Mou et al., 2016;Zhu et al., 2019b), topics (Xing et al., 2017), persona information (Zhang et al., 2019;Song et al., 2019), and retrieved candidate responses (Song et al., 2018;Wu et al., 2019;Zhu et al., 2019a).Another way is to optimize the architecture of networks.There are two architectures widely employed in this research line: the variational auto-encoder (Bowman et al., 2016;Zhao et al., 2017) and the generative adversarial network (Goodfellow et al., 2014;Li et al., 2017a;Zhang et al., 2018;Xu et al., 2018;Tuan and Lee, 2019).Our approach falls into the latter category.The differences between our approach and other adversarial learning-based approaches are as follows.First, we cast the dialogue generation model as an SCM to explore potential responses in the environment where observed responses occur.Second, we learn on counterfactual responses that inferred from the SCM.Third, a pre-trained behavior policy is involved during the generation process, making our approach an off-policy algorithm and benefits the exploration of potential responses.

Counterfactual Reasoning
The counterfactual reasoning is a concept derived from psychology.It describes the human capacity to learn from experience by reasoning the outcome of an alternative action that could have been taken (Pearl and Mackenzie, 2018).Combined with the SCM, counterfactual reasoning improves the performance of policy evaluation in reinforcement learning (Buesing et al., 2019;Oberst and Sontag, 2019).In the area of NLP, counterfactual reasoning in previous work is mainly used for data augmentation (Qin et al., 2019;Fu et al., 2020;Kaushik et al., 2020), which rewrites the original data given a counterfactual label or condition.In this paper, we connect the concept of counterfactual reasoning with the dialogue generation and are the first to cast a generation model as an SCM under the adversarial learning framework.

Method
We cast a dialogue generation model as an SCM to explore potential responses by counterfactual reasoning during the training process.We will first review the concept of the SCM (Sec.3.2), and then introduce our COPT approach (Sec.3.3).

Notation
We use capital letters for random variables (e.g., V ), lowercase letters for instances of random variables (e.g., v), and bold letters for vectors (e.g., V = {V 1 , ..., V N }).During the training process, we denote the response generated by COPT as counterfactual response.In contrast, the response of standard adversarial learning-based dialogue generation (i.e., REGS Li et al., 2017a) is denoted as standard response.

Background: Structural Causal Model
A structural causal model over random variables , where P A i ⊂ V are the parents of V i in a given DAG (Buesing et al., 2019).U is called scenarios, and shows an example of an SCM.Each random variable V i is determined by its parents in V , U i , and f i , e.g., During the training process, we cast a dialogue generation model as an SCM over two random variables: dialogue history X and response Y .This is achieved by converting the conditional distribution P (Y |X) into a deterministic function Y = f π (X, U ) (for more details see Sec. 3.3).The scenario U is a random noise variable that captures all unobserved yet relevant properties, like user profiles.The causal mechanism is denoted as f π to highlight the role of the policy (parameters) π of the model.The dialogue generation SCM makes it possible to sample counterfactual responses in the scenario where observed responses occur.This improves the quality of synthesized responses and subsequently helps the model to explore the highreward area of the potential response space in the training process.
Intervention in SCM Given an SCM, an intervention T is defined as the replacement of some causal mechanisms.Figure 2 shows an example of intervention.The original causal mechanism , resulting in a new SCM in the right.Accordingly, intervention in our dialogue gener-ation SCM denotes the update of the policy.For instance, the update from the behavior policy µ that generates observed responses to the target policy π that we aim to learn is the intervention of replacing f µ (X, U ) with f π (X, U ).
Counterfactual Reasoning in SCM Given an SCM and observed a variable V i = v i , counterfactual reasoning answers the question: "What the variable V i would have been if I take an intervention T while remaining everything else unchanged".In this way, generating a counterfactual response can be seen as querying: "Having observed a response Y = y, what the response Y would have been if I take an intervention by following the target policy π, rather than the behavior policy µ that generates the observed responses".
Typically, counterfactual reasoning answers the question by the following steps (as Figure 3): • Take an intervention by replacing the causal mechanism f µ (X, U ) with f π (X, U ).
• Reason a counterfactual response ŷ = f π (x, u) by the resulting new SCM.
In the following sections, we denote an observed response from the training set as Y and a modelgenerated response as Ŷ .

Counterfactual Off-Policy Training
Our COPT approach is model-agnostic and can be applied to any adversarial learning-based dialogue generation model.Without loss of generality, we take the combination of COPT and the reward for every generation step (REGS) model (Li et al., 2017a) as an example in this section.It consists of two main components: a generator G and a discriminator D.
Generator The generator G is a sequence to sequence (Seq2Seq) model (Sutskever et al., 2014) equipped with the attention mechanism (Bahdanau et al., 2015;Luong et al., 2015).During the encoding process, G reads the dialogue history into hidden states using an encoder LSTM (Hochreiter and Schmidhuber, 1997): where X i is the i-th word of the dialogue history, and H i denotes the corresponding hidden state.The architecture of our COPT approach.π is the target policy that we aim to learn.µ is the behavior policy that generates observed responses.First, we infer the scenario u where the observed response occurs.Then we update the policy from µ to π, which can be seen as an intervention on the left SCM and results in the right SCM.Then, the counterfactual response is reasoned in the inferred scenario u by the causal mechanism Y = f π (X, U ).

SCM
At the j-th decoding time step, the hidden states are summarized into a context vector C j by the attention mechanism.Subsequently, G predicts the distribution of the next word over the vocabulary by a decoder LSTM: where the bracket [•,•] denotes concatenation, and e(•) denotes the embedding of a word.S j is the j-th hidden state of the decoder LSTM.Ŷj−1 is the word generated in the previous time step.O is the output matrix.We use the superscript in P π j to highlight the role of the policy (parameters of G).
Adversarial learning-based dialogue generation model is optimized according to the reward of responses sampled from P π j ( Ŷj |X, Ŷ1:j−1 ) ∈ R |V | (abbreviated as P π j in the following), where |V | is the vocabulary size.Using the Gumbel-Max Trick (Luce, 2012), the sampling process can be achieved by: where the element of U j follows the standard Gumbel distribution.In this way, the generator turns into a Gumbel-Max SCM (Oberst and Sontag, 2019), whose scenarios and causal mechanisms are represented by U j and Equation 4, respectively.
From the perspective of the SCM, each response is generated in a scenario.For instance, a standard response is produced in a scenario sampled from scratch.In contrast, the scenario for a counterfactual response is inferred from an observed response y = {y j | y j = arg max y j (log p * j + u j )}, where * is the user's policy that generates the observed response in the true environment.However, the user's policy is not available in practice, which hinders the posterior inference of the scenario.To this end, we introduce a behavior policy µ instead and learn it by minimizing the MLE loss on observed responses.In this way, an observed response can be seen as generating in a scenario u j while following the policy µ: y j = arg max y j (log p µ j + u j ).According to Oberst and Sontag (2019), there are two ways to infer the scenario u j in hindsight from y j = arg max y j (log p µ j + u j ) given y j and µ.One way is the rejection sampling, which samples u j from the standard Gumbel distribution and rejects those where y j = arg max y j (log p µ j + u j ).The other way of the posterior inference makes use of the properties of the shifted Gumbel g = log p µ j + u j : the maximum of g follows the standard Gumbel distribution and is independent with the argmax of g (Maddison et al., 2014).Therefore, g can be obtained by first sampling a maximum and then sampling the remaining elements truncated at the maximum.And u j is subsequently computed by subtracting log p µ j from g.We employ the second method to infer the scenario in COPT because it is more time-efficient than rejection sampling1 .
Given the scenario inferred from the observed response, COPT reasons the counterfactual response by feeding the dialogue history and the scenario into the SCM (Equation 4).Then the discriminator evaluates the counterfactual response and returns a reward to the generator.Note that the counterfactual response and the SCM are utilized for the training process.During the inference process, responses are generated in the same way as the standard adversarial learning-based dialogue generation (beam search or sampling from P π j , we use the former in our approach) because the observed response is not available.

Discriminator
The discriminator D provides a reward for each generation step.It takes as input the dialogue history X, the word Ỹj produced in the current generation step, and the prefix Ỹ1:j−1 in previous steps, where Ỹ ∈ {Y , Ŷ } can be either an observed response or a model-generated response.The output reward D( Ỹj |X, Ỹ1:j−1 ) is the probability that Ỹj is human-generated.Concretely, D first reads X and Ỹ1:j with an encoder-decoder model.Then, it computes the reward by a Multi-Layer Perceptron (MLP), which takes as input the last hidden state of the decoder.

Adversarial Learning
We train G and D under the adversarial learning framework, where G tries to fool D by generating human-like responses while D aims to distinguish between model-generated and human-generated (the observed) responses.Since a response is a sequence of discrete tokens, we pass by the gradient of D to G using the policy gradient algorithm.In this way, G converts into an agent whose partially generated response and parameters define a state and a policy, respectively.At each generation step, the agent takes an action by producing a word and observes a reward from D to update its policy.
Note that there are two policies in COPT: the target policy that we aim to learn and the behavior policy used for the reasoning of scenarios.The behavior policy is pre-trained and then froze during adversarial learning because it aims to maximize the likelihood of a fixed set of observed responses.Introducing the behavior policy makes COPT an off-policy approach because the counterfactual response, from which the target policy learns, is not entirely based on the target policy itself.
The goal of the generator is to minimize the negative expected reward: J G (θ) = −E Ŷ1:j ∼G D( Ŷj |X, Ŷ1:j−1 ), where θ is the parameters of π.The gradient of θ can be derived by the  Update φ according to Equation 6; 13: end for 14: end for likelihood ratio trick (Williams, 1992): where G π ( Ŷj |X, Ŷ1:j−1 ) is the probability of generating Ŷj with the policy π given X and Ŷ1:j−1 .
The discriminator distinguishes between observed responses and model-generated responses.This is achieved by minimizing the following loss: where φ is the parameters of D. As a positive instance, Y 1:j is a prefix randomly sampled from observed response set S. A negative instance Ŷ1:j for training D is a prefix of a standard response, rather than a counterfactual response.This is because the latter is of higher quality than the former (as shown in Sec.4.7).

Data
The experiments are conducted on the DailyDialog dataset (Li et al., 2017b).2It is a multi-turn dialogue dataset and covers various topics of daily life.The dataset has already been divided into training, validation, and test sets, as shown in Table 1.Given a dialogue that consists of K utterances, we divide it into K-1 instances.Each instance has at most three continuous utterances.The last utterance is the response, and the previous utterances are concatenated as the dialogue history.

Baselines
We compare COPT with the following dialogue generation models: • HRED (Serban et al., 2016): The hierarchical recurrent encoder-decoder.An implementation by Park et al. (2018) is available3 .
• REGS (Li et al., 2017a): Reward for every generation step.Its discriminator is trained on partially generated responses to provide a reward for each generation step.
• DPGAN (Xu et al., 2018): The diversitypromoting GAN introduces a language model based discriminator to encourage the generation of informative responses.4

Training Details
We implement REGS, StepGAN, and their variants with COPT using OpenNMT (Klein et al., 2017), an open-source framework for building sequence to sequence models.We manually tune the parameters according to the perplexity on the validation set.The vocabulary consists of the most frequent 10,000 words.Including more words (up to 17,438, the total number of DailyDialog vocabulary) observes no improvement but takes more time for training.We use 300 dimensional GloVe (Pennington et al., 2014) vectors to initialize word embeddings.Both the encoder and the decoder are a two-layer LSTM in G and a single layer LSTM in D. The number of hidden units is 500.
During the adversarial learning process, we use the ADAM algorithm to alternately optimize G and D for one batch and five batches.The batch size is 64.We have tested the learning rate from 1e-6 to 1e-3.REGS+COPT and StepGAN+COPT achieve the best performance on 1e-5.The number of parameters for all the baselines is in a range of 21M to 26M.Equipping an adversarial learning baseline with COPT will introduce extra parameters with the same amount of the generator's parameters.Contributed by the behavior policy, the parameters are learned by pre-training, and COPT will not increase the number of trainable parameters in adversarial learning.Table 2 shows the average training time.COPT may increase the training time due to the posterior inference of scenarios.But it facilitates the exploration of the high-reward area of the potential response space and subsequently improves the quality of responses.

Evaluation Metrics
Automatic Evaluation We evaluate the diversity and the relevance of generated responses using distinct (Li et al., 2016) and BLEU (Papineni et al., 2002), respectively.The distinct-k is the number of distinct k-grams normalized by the number of words of responses.Since BLEU might correlate weakly with human judgments of quality in the single-reference setting (Liu et al., 2016), we use the multi-reference DailyDialog test set (Gupta et al., 2019), where each instance is augmented with four human-written diverse responses. 5uman Evaluation The human evaluation is conducted on 200 instances randomly sampled from the test set.We create a project on Amazon Mechanical Turk (Buhrmester et al., 2016) (AMT) and employ five AMT workers to give a preference between two responses generated by our approach and a baseline. 6To maintain the quality of the evaluation, the task is visible to workers whose approve rate is greater than 95%, and the number of approved is greater than 500.

Results
Table 3 shows the results of automatic evaluation.Both REGS and StepGAN outperform HRED in distinct-1 and distinct-2, indicating that adversarial learning is beneficial for improving the diversity of responses.There is no increase in DPGAN compared with HRED in our experiments.We believe this is because the scale of the DailyDialog dataset is not large enough for sufficiently training the language model based discriminator.For the same reason, COPT is not added to DPGAN.After introducing COPT, both distinct-1 and distinct-2 in REGS and StepGAN further increase, and the improvement is significant (t-test, p <0.01).This suggests that COPT is model-agnostic to adversarial learning-based approaches and helps to promote the diversity.In terms of BLEU in Table 3, both REGS and StepGAN achieve higher BLEU scores with COPT, and the improvements of BLEU-1 and BLEU-2 are significant (p <0.05).This demonstrates the effectiveness of COPT in improving the relevance of responses.The less significant result of BLEU-3 and BLEU-4 is mainly due to the sparsity of tri-grams and four-grams, which are harder to be covered by references than uni-grams and bi-grams.
The human evaluation results are shown in Table 5.Our approach is clearly preferred as it   has more winning instances than losing instances (p <0.01).The results indicate that COPT helps improve the quality of responses.Following Zhou et al. (2018) and Ke et al. (2018), we measure the agreement of annotators using inter-rater consistency.The percentage of instances that at least three annotators have the same preference (3/5 agreement) is 84.18%.The percentage for 4/5 agreement is 46.89%.

Case Study
Table 4 shows an example of responses generated by baselines and our approach.The response of DPGAN sometimes is not fluent and can be very long.We believe this is also because the scale of the DailyDialog dataset is not enough for the language model discriminator.The response of HRED is not as informative as that of our approach.Its first part is generic, and what the pronoun "that" refers to is not clear.The response of StepGAN is not infor-mative enough as well.In contrast, the response of REGS is quite informative, but its content is not entirely relevant to the dialogue history.After introducing COPT, the responses of REGS+COPT and StepGAN+COPT propose offering a discount to address Person B's concern of the price, which is both informative and relevant.

Analysis
To further analyze COPT's effectiveness in exploring the high-reward area of the potential response space during the training process, we compare the reward of a counterfactual response and a standard response on the same 10,000 randomly sampled training instances.However, the comparison between the two types of responses could be biased if their rewards are computed by different discriminators.Besides, the quality of responses is determined not only by the way they generated (with or without COPT) but also by the generator.To focus on the analysis of COPT and eliminate the bias between generators and discriminators, we generate and evaluate the two types of responses using an identical generator and its corresponding discriminator.Here, we use REGS+COPT and StepGAN+COPT as testbeds because they could generate both the two types of responses.
Figure 4 shows the distribution of rewards and the average reward.The percentage of counterfactual responses in the high reward interval (0.66, 1.00] is higher than that of standard responses.Meanwhile, counterfactual responses generated with COPT achieve a higher average than standard responses.The results demonstrate the effectiveness of the counterfactual response in exploring the high-reward area of the potential response space during the training process.Note that the distribution and the average between different epochs are not comparable due to the update of the discriminator as the training processes.

Conclusion
We propose a model-agnostic approach, COPT, that can be applied to any adversarial learning-based dialogue generation models.In contrast to existing approaches, it learns on counterfactual responses inferred from the structural causal model, taking advantage of observed responses.This helps the model to explore the high-reward area of the potential response space.Experiments show that the COPT significantly improves the quality of the generated responses, which demonstrates the effectiveness of this approach.
Figure 1: An example of a counterfactual response, which is a potential response inferred in hindsight from given observed response.

Figure 2 :
Figure 2: An example of an SCM and an intervention.Left: An SCM with random variables V , scenarios U , and causal mechanisms F represented by colored squares.Right: A new SCM after taking an intervention on the left SCM.The original causal mechanism f2 (V 1 , U 2 ) (denoted by the orange square) is replaced by f T 2 (V 1 , U2) (denoted by the purple square).
Figure3: The architecture of our COPT approach.π is the target policy that we aim to learn.µ is the behavior policy that generates observed responses.First, we infer the scenario u where the observed response occurs.Then we update the policy from µ to π, which can be seen as an intervention on the left SCM and results in the right SCM.Then, the counterfactual response is reasoned in the inferred scenario u by the causal mechanism Y = f π (X, U ).

Algorithm 1
Counterfactual Off-Policy Training 1: Pre-train π and µ with MLE loss; 2: Pre-train D on positive instances sampled from observed responses, and negative instances generated by pre-trained π; 3: for epoch in number of epochs do

Table 1 :
Initialized with different parameters, π and µ are pre-trained on the training set with MLE loss.The pre-training of D depends on the specific model that COPT applied to.For exam-Statistics of the DailyDialog dataset.
(Tuan and Lee, 2019) D on the prefix of a response.In contrast, the discriminator of StepGAN(Tuan and Lee, 2019)is randomly initialized during the adversarial learning process.The overall algorithm of COPT is summarized as Algorithm 1.

Table 2 :
The average training time (in seconds per epoch) on a single GPU.

Table 3 :
Automatic evaluation results of distinct-1 (Dist-1), distinct-2 (Dist-2), and BLEU scores.Think that the will be, but he said that they had a think we have a good job in a. Me, I am sure that I can be able to have a look at the first course.REGS A: If you are willing to withdraw money, we can offer you a private vehicle.REGS+COPT A: If you are willing to pay a commission fee, we can offer you $5 % discount.StepGAN A: What do you mean by that?StepGAN+COPT A: In that case, we can offer you a discount.

Table 4 :
An example of generated responses given dialogue history between person A and B.
Reward distribution and the average reward of the counterfactual response (w/ COPT) and the standard response (w/o COPT).The y-axis in (a) and (b) is the percentage, and the x-axis corresponds to three reward intervals in different epochs, including: Low reward interval [0.00, 0.33], Middle reward interval (0.33, 0.66], and High reward interval (0.66, 1.00].The y-axis in (c) and (d) is the reward, and the x-axis corresponds to epochs.

Table 5 :
Wins, losses, and ties (in %) of our approach against baselines based on the human evaluation.