Incremental Learning from Scratch for Task-Oriented Dialogue Systems

Clarifying user needs is essential for existing task-oriented dialogue systems. However, in real-world applications, developers can never guarantee that all possible user demands are taken into account in the design phase. Consequently, existing systems will break down when encountering unconsidered user needs. To address this problem, we propose a novel incremental learning framework to design task-oriented dialogue systems, or for short Incremental Dialogue System (IDS), without pre-defining the exhaustive list of user needs. Specifically, we introduce an uncertainty estimation module to evaluate the confidence of giving correct responses. If there is high confidence, IDS will provide responses to users. Otherwise, humans will be involved in the dialogue process, and IDS can learn from human intervention through an online learning module. To evaluate our method, we propose a new dataset which simulates unanticipated user needs in the deployment stage. Experiments show that IDS is robust to unconsidered user actions, and can update itself online by smartly selecting only the most effective training data, and hence attains better performance with less annotation cost.


Introduction
Data-driven task-oriented dialogue systems have been a focal point in both academic and industry research recently.Generally, the first step of building a dialogue system is to clarify what users are allowed to do.Then developers can collect data to train dialogue models to support the defined capabilities.Such systems work well if all possible combinations of user inputs and conditions are considered in the training stage (Paek and Pieraccini, 2008;Wang et al., 2018).However, as shown What should I do to update the operating system?
Our products support Android and iOS.Which one do you prefer?
Hi, I can help you find the most suitable product.Figure 1: An example of task-oriented dialogue system.The system is designed to guide users to find a suitable product.Thus, when encountering unconsidered user needs such as "how to update the operating system", the system will give unreasonable responses.
in Fig. 1, if users have unanticipated needs, the system will give unreasonable responses.
This phenomenon is mainly caused by a biased understanding of real users.In fact, before system deployment, we do not know what the customers will request of the system.In general, this problem can be alleviated by more detailed user studies.But we can never guarantee that all user needs are considered in the system design.Besides, the user inputs are often diverse due to the complexity of natural language.Thus, it is impossible to collect enough training samples to cover all variants.Consequently, the system trained with biased data will not respond to user queries correctly in some cases.And these errors can only be discovered after the incident.
Since the real user behaviors are elusive, it is obviously a better option to make no assumptions about user needs than defining them in advance.
To that end, we propose the novel Incremental Dialogue System (IDS).Different from the existing training-deployment convention, IDS does not make any assumptions about the user needs and how they express intentions.In this paradigm, all reasonable queries related to the current task are legal, and the system can learn to deal with user queries online.
Specifically, after the user sends a query to our system, we use an uncertainty estimation module to evaluate the confidence that the dialogue model can respond correctly.If there is high confidence, IDS will give its response to the user.Otherwise, human will intervene and provide a reasonable answer.When humans are involved, they can select a response from the current response candidates or give a new response to the user.If a new answer is provided, we add it to the system response candidates.Then, the generated context-response pair from humans will be fed into the dialogue model to update the parameters by an online learning module.Through continuous interactions with users after deployment, the system will become more and more knowledgeable, and human intervention will become less and less needed.
To evaluate our method, we build a new dataset consisting of five sub-datasets (named SubD1, SubD2, SubD3, SubD4 and SubD5) within the context of customer services.Following the existing work (Bordes et al., 2016), our dataset is generated by complicated and elaborated rules.SubD1 supports the most limited dialogue scenarios.Then each later sub-dataset covers more scenarios than its previous one.To simulate the unanticipated user needs, we train the dialogue models on simpler datasets and test them on the harder ones.Extensive experiments show that IDS is robust to the unconsidered user actions and can learn dialogue knowledge online from scratch.Besides, compared with existing methods, our approach significantly reduces annotation cost.
In summary, our main contributions are threefold: (1) To the best of our knowledge, this is the first work to study the incremental learning framework for task-oriented dialogue systems.In this paradigm, developers do not need to define user needs in advance and avoid collecting biased training data laboriously.(2) To achieve this goal, we introduce IDS which is robust to new user actions and can extend itself online to accommodate new user needs.(3) We propose a new benchmark dataset to study the inconsistency of training and testing in task-oriented dialogue systems.

Background and Problem Definition
Existing work on data-driven task-oriented dialogue systems includes generation based methods (Wen et al., 2016;Eric and Manning, 2017) and retrieval based methods (Bordes et al., 2016;Williams et al., 2017;Li et al., 2017).In this paper, we focus on the retrieval based methods, because they always return fluent responses.
In a typical retrieval based system, a user gives an utterance x t to the system at the t-th turn.Let (x t,1 , ..., x t,N ) denote the tokens of x t .Then, the system chooses an answer y t = (y t,1 , ..., y t,M ) from the candidate response set R based on the conditional distribution p(y t |C t ), where C t = (x 1 , y 1 , ..., x t−1 , y t−1 , x t ) is the dialogue context consisting of all user utterances and responses up to the current turn.
By convention, the dialogue system is designed to handle predefined user needs.And the users are expected to interact with the system based on a limited number of dialogue actions.However, predefining all user demands is impractical and unexpected queries may be given to the system after the system is deployed.In this work, we mainly focus on handling this problem.

Incremental Dialogue System
As shown in Fig. 2, IDS consists of three main components: dialogue embedding module, uncertainty estimation module and online learning module.In the context of customer services, when the user sends an utterance to the system, the dialogue embedding module will encode the current context into a vector.Then, the uncertainty estimation module will evaluate the confidence of giving a correct response.If there is high confidence, IDS will give its response to the user.Otherwise, the hired customer service staffs will be involved in the dialogue process and provide a reasonable answer, which gives us a new ground truth contextresponse pair.Based on the newly added contextresponse pairs, the system will be updated via the online learning module.

Dialogue Embedding
Given dialogue context C t in the t-th turn, we first embed each utterance in C t using a Gated Recurrent Unit (GRU) (Chung et al., 2014) based bidirectional recurrent neural networks (bi-RNNs).
To better encode a sentence, we use the selfattention layer (Lin et al., 2017) to capture information from critical words.For each element h n in bi-RNNs outputs, we compute a scalar selfattention score as follows: The final utterance representation E(x) is the weighted sum of bi-RNNs outputs: After getting the encoding of each sentence in C t , we input these sentence embeddings to another GRU-based RNNs to obtain the context embedding E(C t ) as follows:

Uncertainty Estimation
In the existing work (Williams et al., 2017;Bordes et al., 2016;Li et al., 2017), after getting the context representation, the dialogue system will give a response y t to the user based on p(y t |C t ).However, the dialogue system may give unreasonable responses if unexpected queries happen.Thus, we introduce the uncertainty estimation module to avoid such risks.
To estimate the uncertainty, we decompose the response selection process as follows: As shown in Fig. 3(a), from the viewpoint of probabilistic graphical models (Koller and Friedman, 2009), the latent variable z can be seen as an explanation of the dialogue process.In an abstract sense, given C t , there is an infinite number of paths z from C t to y t .And p(y t |C t ) is an expectation of p(y t |z, C t ) over all possible paths.If the system has not seen enough instances similar to C t before, the encoding of C t will be located in an unexplored area of the dialogue embedding space.Thus, the entropy of prior p(z|C t ) will be large.Based on such intuitive analysis, we design the uncertainty measurement for IDS.Specifically, we assume that the latent variable z obeys a multivariate diagonal Gaussian distribution.Following the reparametrization trick (Kingma and Welling, 2014), we sample ∼ N (0, I) and reparameterize z = µ + σ • .The mean and variance of the prior p(z|C t ) can be calculated as: After sampling a latent variable z from the prior p(z|C t ), we calculate the response probability for each element in the current candidate response set R. In IDS, R will be extended dynamically.Thus, we address the response selecting process with the ranking approach.For each response candidate, we calculate the scoring as follows: where E(y t ) is the encoding of y t ∈ R, and W is the weight matrices.
To estimate the variance of p(y t |z, C t ) under different sampled latent variables, we repeat the above process K times.Assume that the probability distribution over the candidate response set in the k-th repetition is P k and the average response probability distribution of K sampling is P avg .We use the Jensen-Shannon divergence (JSD) to measure the distance between P k and P avg as follows: where KL(P ||Q) is the Kullback-Leibler divergence between two probability distributions.Then, we get the average JSD as follows: Because the average JSD can be used to measure the degree of divergence of {P 1 , P 2 , ..., P K }, as shown in Fig. 4(a), the system will refuse to respond if JSD avg is higher than a threshold τ 1 .However, the dialogue model tends to give close weights to all response candidates in the early stage of training, as shown in Fig. 4(b).It results in a small average JSD but the system should refuse to respond.Thus, we add an additional criterion for the uncertainty measurement.Specifically, if the maximum probability in P avg is lower than a threshold τ 2 , the system will refuse to respond.

Online Learning
If the confidence is high enough, IDS will give the response with the maximum score in P avg to the user.Otherwise, the hired customer service staffs will be asked to select an appropriate response from the top T response candidates of P avg or propose a new response if there is no appropriate candidate.If a new response is proposed, it will be added to R. We denote the human response as r t .Then, we can observe a new context-response pair d t = (C t , r t ) and add it to the training data pool.
The optimization objective is to maximize the likelihood of the newly added data d t .However, as shown in Eq. 5, calculating the likelihood requires an intractable marginalization over the latent variable z.Fortunately, we can obtain its lower bound (Hoffman et al., 2013;Miao et al., 2016;Sohn et al., 2015) as follows: where L is called evidence lower bound (ELBO) and q(z|d t ) is called inference network.The learning process of the inference network is shown in Fig. 3(b).Similar to the prior network p(z|C t ), the inference network q(z|d t ) approximates the mean and variance of the posterior p(z|d t ) as follows: where E(C t ) and E(r t ) denote the representations of dialogue context and human response in current turn, respectively.We use the reparametrization trick to sample z from the inference network and maximize the ELBO by gradient ascent on a Monte Carlo approximation of the expectation.
It is worth noting that tricks such as mixing d t with the instances in the data pool and updating IDS for a small number of epochs (Shen et al., 2017) can be easily adopted to increase the utilization of labeled data.But, in our experiments, we find there is still a great improvement without these tricks.To reduce computation load, we update IDS with each d t only once in a stream-based fashion and leave these tricks in our future work.

Construction of Experimental Data
To simulate the new unconsidered user needs, one possible method is to delete some question types in the training set of existing datasets (e.g., bAbI tasks (Bordes et al., 2016)) and test these questions in the testing phase.However, the dialogue context plays an important role in the response selection.Simply deleting some turns of a dialogue will result in a different system response.For example, in bAbI Task5, deleting those turns on updating api calls will result in a different recommended restaurant.Thus, we do not modify existing datasets but construct a new benchmark dataset to study the inconsistency of training and testing in task-oriented dialogue systems.
We build this dataset based on the following two principles.First of all, we ensure all interactions are reasonable.To achieve that, we follow the construction process of existing work (Bordes et al., 2016) and generate the dataset by complicated and elaborated rules.Second, the dataset should contain several subsets and the dialogue scenarios covered in each subset are incremental.To simulate the new unconsidered user needs, we train the dialogue system on a smaller subset and test it on a more complicated one.
Specifically, our dataset contains five different subsets within the context of customer services.From SubD1 to SubD5, the user needs become richer in each subset, as described below.SubD1 includes basic scenarios of the customer services in which users can achieve two primary goals.First, users can look up a product or query some attributes of interested products.For example, they can ask "Is $entity 5$3 still on sales?" to ask the discount information of $entity 5$.Second, after finding the desired product, users can consult the system about the purchase process and delivery information.SubD2 contains all scenarios in SubD1.Besides, users can confirm if a product meets some additional conditions.For example, they can ask "Does $entity 9$ support Android?" to verify the operating system requirement.SubD3 contains all scenarios in SubD2.In addition, users can compare two different items.For example, they can ask "Is $entity 5$ cheaper than $entity 9$?" to compare the prices of $entity 5$ and $entity 9$.SubD4 contains all scenarios in SubD3.And there are more user needs related to the after-sale service.For example, users can consult on how to deal with network failure and system breakdown.SubD5 contains all scenarios in SubD4.Further more, users can give emotional utterances.For example, if users think our product is very cheap, they may say "Oh, it's cheap and high-quality.I like it!".The dialogue system is expected to reply emotionally, such as "Thank you for your approval.".If the user utterance contains both emotional and task-oriented factors, the system should consider both.For example, if users say "I cannot stand the old operating system, what should I do to update it?",the dialogue system should respond "I'm so sorry to give you trouble, please refer to this: $api call update system$.".
It is worth noting that it often requires multiple turns of interaction to complete a task.For example, a user wants to compare the prices of $entity 5$ and $entity 9$, but not explicitly gives the two items in a single turn.To complete the missing information, the system should ask which two products the user wants to compare.Besides, the context plays an important role in the dialogue.For example, if users keep asking the same product many times consecutively, they can use the subject ellipsis to query this item in the current turn and the system will not ask users which product they are talking about.In addition, taking into account the diversity of natural language, we design multiple templates to express the same intention.The paraphrase of queries makes our dataset more diverse.For each sub-dataset, there are 20,000 dialogues for training and 5,000 dialogues for testing.A dialogue example in SubD5 and detailed data statistics are provided in the Appendices A.

Data Preprocessing
It is possible for the dialogue model to retrieve responses directly without any preprocessing.However, the fact that nearly all utterances contain entity information would lead to a slow model convergence.Thus, we replace all entities with the orders in which they appear in dialogues to normalize utterances.For example, if the $entity 9$ is the second distinct entity which appears in a dialogue, we rename it with $entity order 2$ in the current episode.After the preprocessing, the number of normalized response candidates on both the training and test sets in each sub-dataset is shown in Table 1

Baselines
We compare IDS with several baselines: • IR: the basic tf-idf match model used in (Bordes et al., 2016;Dodge et al., 2015).
• IDS − : IDS without updating model parameters during testing.That is, IDS − is trained only with human intervention data on the training set and then we freeze parameters.

Measurements
Following the work of Williams et al. (2017) and Bordes et al. (2016), we report the average turn accuracy.The turn is correct if the dialogue model can select the correct response, and incorrect if not.Because IDS requires human intervention to reduce risks whenever there is low confidence, we calculate the average turn accuracy only if IDS chooses to respond without human intervention.
That is, compared with baselines, IDS computes the turn accuracy only on a subset of test sets.To be fair, we also report the rate at which IDS refuses to respond on the test set.The less the rejection rate is, the better the model performs.

Implementation Details
Our word embeddings are randomly initialized.The dimensions of word embeddings and GRU hidden units are both 32.The size of the latent variable z is 20.In uncertainty estimation, the repetition time K is 50.In all experiments, the average JSD threshold τ 1 and the response probability threshold τ 2 are both set to 0.34 .In online learning, the number of Monte Carlo sampling is 50.In all experiments, we use the ADAM optimizer (Kingma and Ba, 2014) and the learning rate is 0.001.We train all models in mini-batches of size 32.
6 Experimental Results

Robustness to Unconsidered User Actions
To simulate unexpected user behaviors after deployment, we use the hardest test set, SubD5, as the common test set, but train all models on a simple dataset (SubD1-SubD4) individually.The average turn accuracy is shown in Table 2.When trained on SubD1 to SubD4 and tested on SubD5, as shown in Table 2, the existing methods are prone to poor performance because these models are not aware of which instances they can handle.However, equipped with the uncertainty estimation module, IDS − can refuse to respond the uncertain instances and hence achieves better performance.For example, when trained on SubD1 and tested on SubD5, IDS − achieves 78.6% turn accuracy while baselines achieve only 50.5% turn accuracy at most.Moreover, if updating the model with human intervention data during testing, IDS attains nearly perfect accuracy in all settings.
Due to the uncertainty estimation module, IDS − and IDS will refuse to respond if there is low confidence.The rejection rates of them are shown in Table 3.The rejection rate will drop if the training set is similar to the test set.Unfortunately, the rejection rate of IDS is much higher than that of IDS − .We guess the reason is the catastrophic forgetting (French, 1999;Kirkpatrick et al., 2017).When IDS learns to handle new user needs in SubD5, the knowledge learnt in the training phase will be somewhat lost.Thus, IDS needs more human intervention to re-learn the forgotten knowledge.However, forgetting will not occur if IDS is deployed from scratch and accumulates knowledge online because weights of IDS are optimized alternatively on all possible user needs.

Deploying without Initialization
Compared with existing methods, IDS can accumulate knowledge online from scratch.The un-    certainty estimation module will guide us to label only valuable data.This is similar to active learning (Balcan et al., 2009;Dasgupta et al., 2005).
To prove that, we train baselines on each of the SubDi training data with one epoch of back propagation 5 and test these models on each of the SubDi test set.In contrast, for each SubDi training set, IDS − is trained from random initialization.Whenever IDS − refuses to respond, the current context-response pair in the training set will be used to update the model until all training data in SubDi are finished.Hence IDS − is trained on the subset of SubDi where the response confidence is below the threshold.After the training is finished, we freeze the model parameters and test IDS − on the test set of SubDi.
Table 4 shows the average turn accuracy of different models.Table 5 shows the rejection rate of IDS − on each SubDi training set.We see that, compared with all baselines, IDS − achieves better performance with much less training data.This shows the uncertainty estimation module can select the most valuable data to label online.
Table 6 shows the rejection rate of IDS − on each SubDi test data.We can see that the rejection rate is negligible on SubD1, SubD2 and SubD3.It means IDS − can converge to a low rejection rate after deployment.For SubD4 and SubD5, there 5 In the online learning process of IDS − , each labeled data in the data pool is used only once.For the sake of fairness, we train baselines with only one epoch in this section.are still some instances IDS − can not handle.It is due to the fact that SubD4 and SubD5 are much more complicated than others.In the next section, we further show that as online learning continues, the rejection rate will continue to drop as well.

Frequency of Human Intervention
The main difference between our approach and others is that we introduce humans in the system loop.Therefore, we are interested in the question of how frequently humans intervene over time.The human intervention frequency curves of deploying IDS − without any initialization (i.e., the online learning stage of IDS − in Section 6.2) are shown in Fig. 5.As shown, the frequency of human intervention in a batch will decrease with time.In the early stage of deployment, IDS − has a large degree of uncertainty because there are only a few context-response pairs in the data pool.Through continuous interactions with users, the labeled data covered in the data pool will become more and more abundant.Thus, humans are not required to intervene frequently.
Besides, human intervention curves of different datasets have different convergence rates.The curve of SubD1 has the fastest convergence rate.As the dataset covers more and more user needs, the convergence rate becomes slower.However, there is still a trend to converge for SubD4 and SubD5 as long as we continue the online learning.This phenomenon is in line with the intuition that a more complicated dialogue system requires more training data than a simple one.

Visual Analysis of Context Embedding
To better understand the behavior of our approach, we train IDS − on the SubD5 training set until 2,000 batches online updates are finished, and then  freeze the model parameters and test it on the SubD5 test set.As Table 1 shows, there are 137 unique normalized responses.Among these responses, we pick four of them and draw their context embedding vectors.Each vector is reduced to a 2-dimensional vector via t-SNE (Maaten and Hinton, 2008) for visualization, one sub-graph per response in Fig. 6.In each figure, the red dots are contexts responded by IDS − with high confidence, while the blue dots are contexts responded by human where there is low confidence.These graphs show a clear separation of sure vs. unsure contexts.Some blue dots are far away from the red.Humans should pay attention to these contexts to avoid risks.Besides, there are only a small number of cases when the two classes are mingled.We guess these cases are located in the confidence boundary.In addition, there are multiple clusters in each class.It is due to the fact the same system response can appear in different dialogue scenes.For example, "the system requesting user's phone number" appears in scenes of both exchange and return goods.Although these contexts have the same response, their representations should be different if they belong to different dialogue scenes.

Related Work
Task-oriented dialogue systems have attracted numerous research efforts.Data-driven methods, such as reinforcement learning (Williams et al., 2017;Zhao and Eskenazi, 2016;Li et al., 2017) and supervised learning (Wen et al., 2016;Eric and Manning, 2017;Bordes et al., 2016), have been applied to optimize dialogue systems automatically.These advances in task-oriented dia-logue systems have resulted in impressive gains in performance.However, prior work has mainly focused on building task-oriented dialogue systems in a closed environment.Due to the biased assumptions of real users, such systems will break down when encountering unconsidered situations.
Several approaches have been adopted to address this problem.Gašic et al. (2014) explicitly defined kernel functions between belief states from different domains to extend the domain of dialogue systems.But it is difficult to define an appropriate kernel function when the ontology has changed drastically.Shah et al. (2016) proposed to integrate turn-level and task-level reward signals to learn how to handle new user intents.Lipton et al. (2018) proposed to use BBQ-Networks to extend the domain.However, Shah et al. (2016) and Lipton et al. (2018) have reserved a few bits in the dialogue state for the domain extension.To relax this assumption, Wang et al. (2018) proposed the teacher-student framework to maintain dialogue systems.In their work, the dialogue system can only be extended offline after finding errors and it requires hand-crafted rules to handle new user actions.In contrast, we can extend the system online in an incremental6 way with the help of hired customer service staffs.
Our proposed method is inspired by the cumulative learning (Fei et al., 2016), which is a form of lifelong machine learning (Chen and Liu, 2016).This learning paradigm aims to build a system that learns cumulatively.The major challenges of the cumulative learning are finding unseen classes in the test set and updating itself efficiently to accommodate new concepts (Fei et al., 2016).To find new concepts, the heuristic uncertainty estimation methods (Tong and Koller, 2001;Culotta and McCallum, 2005) in active learning (Balcan et al., 2009;Dasgupta et al., 2005) can be adopted.When learning new concepts, the cumulative learning system should avoid retraining the whole system and catastrophic forgetting (French, 1999;Kirkpatrick et al., 2017).But the catastrophic forgetting does not happen if the dialogue system is trained with all possible user needs alternatively from scratch.
The uncertainty estimation and online learn-ing methods in our work are inspired by variational inference approach (Rezende et al., 2014;Kingma and Welling, 2014).In the existing work, this approach was used to generate diverse machine responses in both open domain dialogue systems (Zhao et al., 2017;Serban et al., 2016) and task-oriented dialogue systems (Wen et al., 2017).In contrast, our work makes use of the Bayesian nature of variational inference to estimate the uncertainty and learn from humans.Specifically, we sample variables from the prior network as the random perturbation to estimate the model uncertainty following the idea of Query-By-Committee (Seung et al., 1992) and optimize model parameters by maximizing the ELBO.

Conclusion
This paper presents a novel incremental learning framework to design dialogue systems, which we call IDS.In this paradigm, users are not expected to follow any definition, and IDS has potential to handle new situations.To simulate new user actions after deployment, we propose a new dataset consisting of five different subsets.Experiments show that IDS is robust to new user actions.Importantly, with humans in the loop, IDS requires no data for initialization and can update itself online by selecting the most valuable data.As the usage grows, IDS will cumulate more and more knowledge over time.

Figure 2 :
Figure 2: An overview of the proposed IDS.

Figure 3 :
Figure 3: Graphical models of (a) response selection, and (b) online learning.The gray and white nodes represent the observed and latent variables respectively.

Figure 4 :
Figure 4: A toy example to show the uncertainty estimation criterions.(a) means a large variance in the response probability under different sampled latent variables.(b) means close weights to all response candidates in the early stage of online learning.

Figure 5 :
Figure 5: The intervention frequency curves after deploying IDS − without any initialization.

Figure 6 :
Figure 6: t-SNE visualization on the context representations of four different system responses.Red dots are contexts responded by IDS − with high confidence, while blue dots are contexts with low confidence.
If we sample latent variable z based on p(z|C t ) multiple times and calculate p(y t |z, C t ), we can find p(y t |z, C t ) has a large variance under different sampled latent variables z.

Table 1 :
. The number of normalized response candidates in each sub-dataset after entity replacement, both training and test data included.

Table 2 :
The average turn accuracy of different models.Models are trained on SubD1-SubD4 respectively, but all tested on SubD5.Note that, unlike the existing methods, IDS − and IDS give responses only if there is high degree of confidence.

Table 3 :
The rejection rate on the test set of SubD5.

Table 4 :
The average turn accuracy of different systems on SubDi test set.Note each baseline is trained on the entire SubDi training data, but IDS − is trained only on the low-confidence subset of SubDi training set.The parameters of all system are frozen during testing.

Table 5 :
The rejection rate of IDS − on SubDi training set.

Table 6 :
The rejection rate of IDS − on SubDi test set.