Self-Supervised Dialogue Learning

The sequential order of utterances is often meaningful in coherent dialogues, and the order changes of utterances could lead to low-quality and incoherent conversations. We consider the order information as a crucial supervised signal for dialogue learning, which, however, has been neglected by many previous dialogue systems. Therefore, in this paper, we introduce a self-supervised learning task, inconsistent order detection, to explicitly capture the flow of conversation in dialogues. Given a sampled utterance pair triple, the task is to predict whether it is ordered or misordered. Then we propose a sampling-based self-supervised network SSN to perform the prediction with sampled triple references from previous dialogue history. Furthermore, we design a joint learning framework where SSN can guide the dialogue systems towards more coherent and relevant dialogue learning through adversarial training. We demonstrate that the proposed methods can be applied to both open-domain and task-oriented dialogue scenarios, and achieve the new state-of-the-art performance on the OpenSubtitiles and Movie-Ticket Booking datasets.

However, the utterance generation from dialogue systems still faces some critical challenges, including utterance blandness and incoher-ence (Gao et al., 2018).They are mainly caused by the objective function of the dialogue systems that prefer utterances with unconditionally high probability (Li et al., 2016a).We argue that in a meaningful and coherent dialogue, the change of utterance order will lead to a low-quality dialogue.However, most existing neural-based dialogue systems either encode the full dialogue history (Li et al., 2017;Xu et al., 2017) or only the current utterance (Liu and Lane, 2018).None of them explicitly models the sequential order and studies its criticality to the dialogue learning problem.
In this paper, we explore the sequential order within the dialogue as the self-supervised signal to guide meaningful and coherent dialogue learning.We introduce a self-supervised learning task, inconsistent order detection, to explicitly capture the order signal of the dialogue.The task is defined as, given a target utterance pair triple, the model is required to predict whether the triple is correctly ordered or not.For instance, the utterance pair triple (Q 1 , A 1 ), (Q 4 , A 4 ), (Q 2 , A 2 ) is misordered.The key to solving this task is to model the utterance order based on the dialogue context effectively.But when directly encoding the full dialogue history along the temporal order, the model actually only focuses on the ending utterances, and earlier information is largely discarded (Li et al., 2017).Thus, we propose a sampling-based selfsupervised network (SSN ) to account for the forgetfulness problem and solve the inconsistent order detection task.In order to accurately predict if a target utterance triple is ordered or not, we randomly sample utterance triples from the dialogue history as the reference to incorporate the dialogue context.Since for the same target utterance triple, the sampled triple references are different at different iterations during training.It essentially approximates the full dialogue history without suf-arXiv:1907.00448v1[cs.CL] 30 Jun 2019 fering from the forgetfulness issue.
To further utilize SSN in real dialogue learning, we propose to jointly learn SSN and the dialogue model via alternative training, where the output probability of SSN is treated as the order signal to evaluate the generated utterance.Moreover, the proposed approach can be applied to both open-domain and task-oriented dialogue learning, which indicates that SSN is a general and scalable approach for dialogue learning.Empirical results on two widely-used benchmark datasets, OpenSubtitles and Movie-Ticket Booking, show that our self-supervised network consistently improves the state-of-the-art (SOTA) neural-based dialogue training methods.In summary, our main contributions are three-fold: • We introduce the task of inconsistent order detection, and propose a self-supervised learning network SSN to solve this task and explicitly model the crucial order information in dialogue.
• We propose a general framework to jointly learn SSN and the dialogue models, where the sequential order in dialogues can be explicitly used to guide the utterance generation.
• Our method advances the existing state-ofthe-art dialogue systems in both open-domain and task-oriented scenarios.

Related Work
Dialogue Learning Dialogue systems can be roughly classified into open-domain and taskoriented scenarios.In recent years, neural-based conversation models have shown great power in building dialogue systems (Ritter et al., 2011;Sordoni et al., 2015b;Vinyals and Le, 2015;Serban et al., 2016;Luan et al., 2016).However, the utterances generated by neural-based dialogue systems still suffer from blandness and incoherence (Gao et al., 2018).To address these problems, Li et al. (2016a) propose a mutual information objective to infer the utterance generation.Serban et al.
(2017) and Zhang et al. (2018a) further apply the latent variable models to generate more specific responses.Similar to some language generation tasks (Lamb et al., 2016;Yu et al., 2017), Generative adversarial networks (GAN) (Goodfellow et al., 2014) have also been adapted to learn a better objective function for the dialogue (Li et al., 2017;Xu et al., 2017;Liu and Lane, 2018;Su et al., 2018).The discriminator in GAN is often used to evaluate the generated utterances and guide dialogue learning.However, these methods mainly focus on the surface information of generated utterances to guide the dialogue learning, and fail to consider the utterance connection within the dialogue history.In this paper, we focus on the sequential information of the dialogue and show that the unique sequential order in a meaningful and coherent dialogue contains more useful semantic information for dialogue learning.
Self-Supervised Learning Self-supervised learning, which aims to train a network on an auxiliary task where ground-truth is obtained automatically, has been successfully applied in computer vision.Many self-supervised tasks have been introduced to use non-visual but intrinsically correlated features to guide the visual feature learning (Doersch et al., 2015;Wang and Gupta, 2015;Pathak et al., 2016).As for natural language processing, predicting nearby words (Mikolov et al., 2013b,a) is a self-supervised task to learn word embeddings.The language modeling is another line of self-supervision where a language model learns to predict the next word given the previous sequence (Bengio et al., 2003;Dai and Le, 2015;Peters et al., 2018).Recently, Devlin et al. (2019) further proposes two self-supervised tasks, the masked language model and next sentence prediction, to learn sentence embeddings.Lample and Conneau (2019); Liu et al. (2019) further extend these two tasks into multi-lingual and multi-task paradigms.Wang et al. (2019) consider them at the sentence-level for extractive summarization.Our work is the first to consider the sequential order as the self-supervised signal in dialogue and we propose the self-supervised task of inconsistent order detection towards more coherent and relevant dialogue learning.

Methods
In this section, we systematically describe how to utilize the internal sequential order of utterances as self-supervision for dialogue learning.In Section 3.1, we first introduce the task of inconsistent order detection, where the model needs to predict whether one sampled triple of the dialogue is correctly ordered or not.We then present an effective sampling-based approach, self-supervised network (SSN ), to learn to capture the important The overview of our self-supervised network (SSN ) for inconsistent order detection.Given a target triple containing the current utterance pair (Q t , A t ) to be predicted, (a) we first sample triple references from previous dialogue history } in each iteration.The references can be ordered or misordered.(b) For each triple, it is transformed into the triple embedding.The concatenation of triple embeddings is fed into a MLP, and gives the probability based on the current sampling.
order signal and solve this task (see Section 3.2).
In the end, we show in Section 3.3 how SSN can contribute to both open-domain and task-oriented dialogue learning by modeling the inconsistent order detection.

Inconsistent Order Detection
The dialogue systems aim at conversing with the human in a meaningful and coherent way (Gao et al., 2018).Thus, the sequential order in dialogue data is an important signal for building a good dialogue system.Existing neuralbased dialogue systems only consider this signal in a weak and implicit way, where they use hierarchical encoders to model the dialogue history (Sordoni et al., 2015a;Serban et al., 2016;Li et al., 2017;Serban et al., 2017;Xing et al., 2018).However, we argue that these methods are mainly designed to model the overall semantic context information of the dialogue history but not good at modeling intermediate sequential order.Especially, the order signal is becoming weak as the number of dialogue turns increases.Thus, we propose the task of inconsistent order detection to force building models to capture this signal as self-supervision explicitly.Given a dialogue till the turn t, we can formulate it as ) is a pair of human-machine utterances.
Then we can sample multiple triples of this dialogue as utterance pair triples using the following strategies: • Ordered triple sampling: We sample a triple following the dialogue sequential order as • Misordered triple sampling: The three utterance pairs are sampled in a triple as Note that when the current dialogue length t <= 2, it is not enough to get a rational sampling for utterance pair triples.Thus, we add three extra shared padding utterance pairs ) and (Q 0 , A 0 ) ahead of all the dialogue data before sampling 1 .
Based on above triple sampling strategies, we define the task of inconsistent order detection as: given a dialogue history for evaluation, the model needs to predict whether the sampled triple 1 Specifically, e.g., for the added padding utterance Q−2, it is represented as a sequence of one same padding word {w }, where N is the roundedup averaged length of utterances in the dataset.

Self-Supervised Network SSN
We plan to build the model to solve the inconsistent order detection task, and explicitly capture the sequential order in dialogue.The overview of our approach is shown in Figure 1.At each dialogue turn t, given a target triple containing the current utterance pair, we first sample triple references from the previous dialogue history to capture more semantic context in dialogue.The target triple and triple references are then transformed into embeddings using an utterance pair encoder and an order reasoning layer.Finally, the concatenation of embeddings is used for the final prediction.We then describe the SSN in detail as follows.

Triple Reference Sampling
Given the task definition in Section 3.1, the model needs to predict whether there is inconsistent order in the target triple containing the current utterance pair (Q t , A t ).It is intuitive that if we can get more previous dialogue history, we may make a better prediction for inconsistent order.One trivial way is to encode the full previous dialogue history using a hierarchical network and make the prediction.However, Li et al. (2017) suggests that this structure actually focuses more on the final two preceding utterances instead of the whole history.The sequential order signal is very weak in this condition.We also report some similar results in Section 4.1.
Therefore, we propose a sampling-based approach to model the utterance order based on the dialogue context effectively.For each sampling operation, we sample two triple references T and T from the previous dialogue history ing the sampling strategies in Section 3.1.In general, we explore the following three combinations of reference sampling strategies for T and T : • T and T are sampled ordered references.
• T and T are sampled misordered ones.
• T is ordered while T is misordered.
Note that in our experiments, we choose one certain combination and keep using it for sampling the triple references for all the target triples.

Objective Function
Given the target triple embedding T and the triple reference embedding T and T , we use SSN to calculate the probability p(T |T , T ) = SSN (T, T , T ).We use the Binary Cross Entropy loss to train the model: where y is the ground-truth label.
Considering that for the same target triple T , the triple references are sampled m times to approximate the full dialogue history.Then we can rewrite the loss function as where T (i) , T (i) are the triple references of i-th sampling.This is essentially a Monte Carlo estimation and the model would effectively incorporate the dialogue context and capture the order information, avoiding from directly encoding the full dialogue history and the forgetfulness issue.

Network Structure
In this section, we demonstrate how SSN embeds both the target triple T and triple reference T and T to generate p(T |T , T ) in each sampling.
Utterance Pair Encoder First, given a utterance pair (Q t , A t ), we concatenate the Q t and A t as one sequence.The sequence is then fed into a bidirectional long short-term memory network (LSTM) (Hochreiter and Schmidhuber, 1997), and the utterance pair embedding U t is the concatenation of the final two states of the bi-LSTM: where N t is the length of the concatenated utterance sequence.
Order Reasoning Layer After obtaining the utterance pair embeddings we need to reason and predict whether there is inconsistent order or not.To simplify our model, we use a 3-step reasoning bi-LSTM with the maxpooling layer to perform the order reasoning: where the input of each time step in bi-LSTM is one utterance pairs embedding, and T is the final embedding of the given triple.
Given the target triple embedding T and the triple reference embedding T and T , the concatenation of these three embeddings is fed into a multi-layer perceptron, returning the probability p(T |T , T ) of the triple is ordered (approaching 0) or misordered (approaching 1).

Self-Supervised Network for Dialogue
In this section, we explain how the SSN can be applied to the current dialogue system in both open-domain and task-oriented scenarios.
Suppose we have a dialogue system the the history )}, at turn t, the system generate the utterance A t based on the Q t .We can sample a misordered target triple T containing (Q t , A t ).Following the assumption that the sequential order in a meaningful and coherent dialogue should be unique, the SSN will be easy to detect the inconsistent order in T if the generated A t is good.Otherwise, the A t may be of low quality.Therefore, we take a two-step sampling approach to evaluate the generated utterance A t using SSN .First, a misordered target triple T containing (Q t , A t ) is sampled.Then we further sample triple references T and T as in Section 3.2.1 and how easily the misorder in the sampled T can be detected is measured as E T ,T (p(T |T , T ).Based on the generated utterance A t , we can sample multiple misordered T , and we set the following expectation to measure the probability that A t is a good generated utterance: In this way, we can view human-generated utterances as good ones, and machine-generated utterances as bad ones.Then we can use the adversarial training methods (Goodfellow et al., 2014;Li et al., 2017;Xu et al., 2017;Su et al., 2018) to train the dialogue system, where SSN can give clear order-based signal to guide the generator G in the system.The framework of using SSN with the two-step sampling in real dialogue systems are shown in Figure 2. The objective function then can be formulated as: where θ G and θ SSN are the parameters of the generator G and SSN in the dialogue systems separately.
The x stands for real human-generated utterances, which G(.) represents machine-generated ones.The G and SSN are alternately updated during training.We further describe the details in open-domain and taskoriented scenarios separately.

Open-Domain Dialogue Learning
The open-domain dialogue task is, given a dialogue history consisting of a sequence of dialogue utterances {(Q 1 , A 1 ), . . ., (Q t−1 , A t−1 )}, and current Q t , the model needs to generate a response utterance A t .We consider the adversarial training (Li et al., 2017;Xu et al., 2017) for dialogue generation systems.Following the previous approach (Vinyals and Le, 2015;Serban et al., 2016;Luan et al., 2016;Li et al., 2017), we use the SEQ2SEQ model for response generation as the generator G.The SEQ2SEQ first transforms the dialogue history into an embedding using an encoder recurrent network.Conditioned on the history embedding, another decoder recurrent network then computes the probability of tokens at each generation step of the response using a softmax function.
As for the discriminator D, in previous methods, the discriminator directly takes the response utterance A t with or without the full dialogue history, and predicts whether it is human-generated (output: 1) or machine-generated (output: 0).The probability of being human-generated is set as the reward to update the G using the REINFORCE algorithm (Williams, 1992).As for our SSN , the reward R is set as R = p * SSN .

Task-Oriented Dialogue Learning
The task-oriented dialogue, usually formulated as a reinforcement learning problem, aims to build a dialogue agent to interact with real users and learn the policy to complete the slot-filling task (Jurafsky and Martin, 2014).While the real-user interaction is expensive and time-consuming, in this scenario, the dialogue systems are often trained with user simulators (Schatzmann et al., 2006;Li et al., 2016c).However, due to the complexity of real conversations and biases in the design of user simulators, the quality of simulated utterances is unstable.Su et al. (2018) propose an adversarial learning approach to differentiate simulated experience from real experience.Following the similar assumption that real-user interactions should be meaningful and coherent, we implement our SSN instead of the conventional discriminator D to select high-quality stimulated utterances in the task-oriented dialogue systems.
In this scenario, the generator G is the world model which produces simulated user experience, and the SSN focuses on scoring the simulated user experience Q t during the training process.Thus, instead of sampling and encoding utterance pairs (Q t , A t ), here we only use the user utterance Q t in SSN .We keep other parts of the SSN remain the same as in Section 3.2.Because the world model G is updated using the multi-task learning without the reward from the SSN , the objective function of the SSN in Equation 6 can be rewritten as the following during the mini-batch training: (7) where b represents the batch size.

Intrinsic Evaluation
Before we deploy the self-supervised network into real dialogue systems, we first test the model architectures for reliability.We randomly choose 40K balanced ordered and misordered utterance pair triples from the OpenSubtitles (Tiedemann, 2009) dataset, and train the SSN to solve this 2class classification.We sample another 1K balanced triples for testing.We also consider a baseline model, where the target triple is encoded by SSN , and the previous dialogue history is encoded by a hierarchical LSTM.The concatenation of two embeddings is used for the final prediction.Because our SSN is a sampling-based ap- proach, we report the average prediction accuracy of 5 runs on the 2-class classification as shown in Table 1.
From the results, we can observe that: (1) The conventional hierarchical LSTM is not suitable for this task, and this baseline only shows a marginal improvement compared with the strategy that only considers target triple without any history.The results also match previous findings (Li et al., 2017), where they suggest that only the last two proceeding utterances in the hierarchical network are semantically significant.(2) As for our SSN , it is safe to tell that reference triples can be a tremendous supplement to the inconsistent order detection.It is not surprising because by adding reference triples, the SSN will know more information of semantic context within the dialogue.Especially when having both ordered and misordered references, the SSN has the highest classification accuracy.This also shows that the sampling strategy, 1*Ordered + 1*misordered references, is the most reliable structure for real dialogue systems.Thus, for the rest of the experiments, we directly use the SSN with one ordered and one misordered references strategy to achieve the best performance.

Open-Domain Dialogue Learning
Dataset Following the previous studies (Vinyals and Le, 2015;Li et al., 2017;Xu et al., 2017), we choose the widely-used OpenSubtitles (Tiedemann, 2009) dataset to evaluate different methods.The OpenSubtitles dataset contains movie scripts organized by characters, where we follow Li et al. (2016b) to retain subtitles containing 5-50 words.
Baselines We consider the following two popular adversarial methods for dialogue learning as the baselines: • REGS (Li et al., 2017)  cal LSTM, and the Monte Carlo search is implemented to obtain rewards for every generation step to update the generator G.
• AEL (Xu et al., 2017): The discriminator D only encodes the currently generated utterance by a CNN model and the generator G is optimized using an approximate embedding layer.
Implementation Details We follow the most of parameters in Li et al. (2017); Xu et al. (2017) to make a fair comparison.For the generator model G, we adopt the same SEQ2SEQ model (Sutskever et al., 2014) with an attention mechanism (Bahdanau et al., 2015;Luong et al., 2015) for our approach and baselines.We approximate the dialogue history for G using the concatenation of two preceding utterances following the Li et al. (2017).
To train the generator G, we use the REINFORCE algorithm (Williams, 1992) to maximize the expected reward of generated utterances.We also implement the Monte Carlo search to give rewards for each generation step.To accelerate the sampling process, we use multiple GPUs to parallelize and distribute the jobs.As for the SSN , it first gets pre-trained using sampled data from Open-Subtitiles, and then iteratively updated during the min-max adversarial training process.The dimension of the utterance embeddings is 128.The hidden size is 256 for utterance encoding bi-LSTM and 1024 for triple reasoning bi-LSTM.The MLP has a single hidden layer of size 512.Adversarial Evaluation Here we use adversarial success rate (AdverSuc), which is the fraction of instances where a G is capable of fooling the D, to evaluate different methods.Higher values of AdverSuc for a dialogue system usually lead to a better response generator.After training three (G, D) using REGS, AEL and SSN , we sample 4K dialogue history and use three trained generators to generate response utterances.These machine-generated utterances are then fed into three trained discriminators to see if they are indistinguishable from human-generated ones.The cross evaluation of AdverSuc is shown in Table 2.
From the results, we can observe that: (1) Our trained generator achieve higher AdverSuc in three discriminators, which shows that the generator in our approach can generate more humanlike utterance responses.(2) The generators of the other two methods have a noticeable drop in AdverSuc when evaluating on our SSN -based discriminator.This demonstrates that our selfsupervised policy for discriminating utterances is successful.(3) The REGS method with full dialogue history encoded performs worse than the AEL that only considers the current utterances.We think this indicates that without explicitly stating the guiding signal, both the generator and the discriminator can be lost about figuring out a good objective function during the training process even when encoding the full history.
Automatic Evaluation For automatic evaluations, we use the two commonly accepted metrics distinct-1 and distinct-2.The distinct-1 and distinct-2, proposed by Li et al. (2016a), are two ways to measure the degree of diversity by calculating the number of distinct unigrams and bigrams in the generated response utterances.The evaluation results are reported in Table 3.The results show that based on the distinct-1 and distinct-2 metrics, the generator trained in our approach can generate relatively more diverse responses.The results are attractive considering that  Human Evaluation For human evaluation, we follow protocols in Li et al. (2016a) and employing crowd-sourced judges from the Amazon Mechanical Turk to evaluate a random sample of 1000 unique generated utterances from three generators in the OpenSubtitles test dataset.We present both the input dialogue history and the generated responses to 5 judges and ask them to decide which one of the three results is the be.tsTies are not permitted.We consider both single-turn and multiturn for the evaluation.The results are shown in Table 4. Evidently, the generator trained in our method shows a significant improvement in the quality of generated sentences.The gain is even higher in the multi-turn setting than the single-turn setting.This is because when only considering the single-turn dialogue, the information encoded in three methods will be similar.

Task-Oriented Dialogue Learning
Dataset Following the previous work (Peng et al., 2018;Su et al., 2018), we use the same Movie-Ticket Booking dataset collected from Amazon Mechanical Turk for evaluation.The dataset is manually labeled based on a schema defined by domain experts consisting of 11 intents and 16 slots in the full domain setting.In total, the dataset has 280 annotated dialogues with an average length of approximately 11 turns.In this scenario, the goal of dialogue systems is to help the user complete the tasks through the conversation.
Baselines We compare our SSN -based discriminator within the state-of-the-art task-oriented dialogue policy learning approach, Discriminative Deep Dyna-Q (D3Q) (Su et al., 2018).At each turn, the D3Q agent takes S planning steps interacting with the simulator and store stimulated user experiences based on the scoring of the discriminator.The stimulated user experiences are generated by the world model, which can be viewed as the generator G in our case.We replace the conventional discriminator D of D3Q with our SSN .
Implementation Details For a fair comparison, we remain most of the parameters in the D3Q algorithm the same as in Su et al. (2018).In the self-supervised network, the dimension of the utterance embeddings is 80.The hidden size is 128 for utterance encoding bi-LSTM and 512 for triple reasoning bi-LSTM.The MLP has a single hidden layer of size 128.We use the simulator2 as in Li et al. (2016c) to generate user utterances, and the threshold interval is set to a range between 0.45 and 0.55.

Results
The experimental results of different agents at training epoch are shown in Table 5.
From the results, we can observe that: (1) The D3Q-SSN outperform the D3Q in the most of cases, which shows that our SSN -based discriminator can improve the ability to recognize the high-quality stimulated user experiences.
(2) When the planning step increases in D3Q, the performance shows an apparent drop.This is because the discriminator D in the original D3Q agent keeps lots of low-quality stimulated user experiences, which significantly degrade the performance of the D3Q agent.As for our SSN , we can see some performance improvement even when using 10-step planning.This substantially means that our SSN has a better ability to select the good simulated user experiences, especially in the multi-turn dialogue cases.

Conclusion
In this paper, we introduce a self-supervised task, inconsistent order detection, to explicitly capture the order signal of the dialogue.While previous methods suffer from forgetfulness problem when modeling dialogue history, we further propose a sampling-based self-supervised network SSN , to approximately encoding the dialogue history and highlight the order signal.We also show how our SSN can contribute to real dialogue learning.Empirically, our method advances the previous state-of-the-art dialogue systems in both opendomain and task-oriented scenarios.Theoretically, we believe this self-supervision can be generalized to other types of temporal order in different NLP tasks.
Figure1: The overview of our self-supervised network (SSN ) for inconsistent order detection.Given a target triple containing the current utterance pair (Q t , A t ) to be predicted, (a) we first sample triple references from previous dialogue history{(Q 1 , A 1 ), • • • , (Q t−1 , A t−1 )} in each iteration.The references can be ordered or misordered.(b) For each triple, it is transformed into the triple embedding.The concatenation of triple embeddings is fed into a MLP, and gives the probability based on the current sampling.

Figure 2 :
Figure 2: The general framework for dialogue learning with self-supervised network.

Table 1 :
The intrinsic evaluation results.The numbers in brackets stand for deviation.Refers: Reference Triples.

Table 2 :
: The discriminator D takes the full dialogue history by a hierarchi-The cross evaluation of adversarial success rate on different generators and discriminators.Please refer to Section 4.2 Adversarial Evaluation for explanations.

Table 3 :
The automatic evaluation of generated utterances on distinct-1 and distinct-2 metrics.Please refer to Section 4.2 Automatic Evaluation for explanations.

Table 4 :
The human evaluation of generated utterances in three methods.The result here is statistically significant with p < 0.01 according to sign test.Please refer to Section 4.2 Human Evaluation for explanations.

Table 5 :
The experimental results of different dialogue agents at training epoch = {100, 200, 300}.Each number is averaged over 3 runs, and each run tested on 50 dialogues.TheD3Q-SSN denotes the D3Q agent where our proposed SSN replaces the discriminator.The "fixed θ D /θ SSN " indicates the discriminator/SSN is pre-trained and fixed during the training process.Succ: Success Rate.Reward: Average Reward.Turns: Average Turns.wedo not explicitly use a diversity-guided objective function during the training process.We think the reason is that the diverse utterances are easier to reserve the order information.In previous methods, the discriminator D only gives good or bad signals to response generator G, and the G has to figure out what is an acceptable response by itself.As for our SSN , it explicitly forces the G to generate responses that will have unique orders in dialogue, which leads to more diverse utterances.