A Teacher-Student Framework for Maintainable Dialog Manager

Reinforcement learning (RL) is an attractive solution for task-oriented dialog systems. However, extending RL-based systems to handle new intents and slots requires a system redesign. The high maintenance cost makes it difficult to apply RL methods to practical systems on a large scale. To address this issue, we propose a practical teacher-student framework to extend RL-based dialog systems without retraining from scratch. Specifically, the “student” is an extended dialog manager based on a new ontology, and the “teacher” is existing resources used for guiding the learning process of the “student”. By specifying constraints held in the new dialog manager, we transfer knowledge of the “teacher” to the “student” without additional resources. Experiments show that the performance of the extended system is comparable to the system trained from scratch. More importantly, the proposed framework makes no assumption about the unsupported intents and slots, which makes it possible to improve RL-based systems incrementally.


Introduction
With the flourish development of virtual personal assistants (e.g., Amazon Alexa and Google Assistant), task-oriented dialog systems, which can help users accomplish tasks naturally, have been a focal point in both academic and industry research. In the early work, the task-oriented dialog system is merely a set of hand-crafted mapping rules defined by experts. This is referred to as a rule-based system. Although rule-based systems often have acceptable performance, they are inconvenient and difficult to be optimized. Recently, reinforcement learning approaches have been applied to optimize dialog systems through interaction with a user simulator or employed real users online (Gašić et al., 2011;Su et al., 2016a;Li et al., 2016, Figure 1: An example of a task-oriented dialog after the system comes online. The user is confused because the "confirm" intent has not been considered in the deployed system. Dialog rules should be embedded in a new system to handle such situations. 2017b). It has been proven that RL-based dialog systems can abandon hand-crafted dialog manager and achieve more robust performance than rulebased systems (Young et al., 2013).
Typically, the first step of building RL-based dialog systems is defining a user model 1 and necessary system actions to complete a specific task (e.g., seek restaurants information or book hotels). Based on such ontology, developers can extract dialog features and train the dialog manager model in an interaction environment. Such systems work well if real users are consistent with the predefined user model. However, as shown in Fig. 1, the unanticipated actions 2 of real users will lead to a poor user experience.
In this situation, the original system should be extended to support new user actions based on user feedback. However, adding new intents or slots will change the predefined ontology. As a consequence, developers need to extract additional dialog features based on new ontology. Besides, new system actions may be added to deal with new user actions. The network architecture of the new system and the original one will be different. The new system can not inherit the parameters from the old one directly. It will make the original dialog manager model invalid. Therefore, developers have to retrain the new system by interacting with users from scratch. Though there are many methods to train a RL-based dialog manager efficiently (Su et al., 2016a(Su et al., , 2017Lipton et al., 2017;Chen et al., 2017), the unmaintainable RL-based dialog systems will still be put on the shelf in real-world applications (Paek and Pieraccini, 2008;Paek, 2006).
To alleviate this problem, we propose a teacherstudent framework to maintain the RL-based dialog manager without training from scratch. The idea is to transfer the knowledge of existing resources to a new dialog manager.
Specifically, after the system is deployed, if developers find some intents and slots missing before, they can define a few simple dialog rules to handle such situations. For example, under the condition shown in Fig. 1, a reasonable strategy is to inform the user of the location of this restaurant. Then we encode information of such hand-crafted logic rules into the new dialog manager model. Meanwhile, user logs and dialog policy of the original system can guide the new system to complete tasks like the original one. Under the guidance of the "teacher" (logic rules, user logs, and original policy), we can reforge an extended dialog manager (the "student") without a new interaction environment.
We conduct a series of experiments with simulated and real users on restaurant domain. The extensive experiments demonstrate that our method can overcome the problem brought by the unpredictable user behavior after deployment. Owing to reuse of existing resources, our framework saves time in designing new interaction environments and retraining RL-based systems from scratch. More importantly, our method does not make any assumptions about the unsupported intents and slots. So the system can be incrementally extended once developers find new intents and slots that are not taken into account before. As far as we know, we are the first to discuss the maintainability of deep reinforcement learning based dialog systems systematically.

Related Work
Dialog Manager The dialog manager of taskoriented dialog systems, which consists of a state tracker and a dialog policy module, controls the dialog flow. Recently, deep reinforcement learning (Mnih et al., 2013(Mnih et al., , 2015 has been applied to optimize the dialog manager in an "endto-end" way, including deep Q-Network (Lipton et al., 2017;Li et al., 2017b;Peng et al., 2017;Zhao and Eskenazi, 2016) and policy gradient methods (Williams et al., 2017;Su et al., 2016b;Dhingra et al., 2017). RL methods have shown great potential in building a robust dialog system automatically. However, RL-based approaches are rarely used in real-world applications because of the maintainability problem (Paek and Pieraccini, 2008;Paek, 2006). To extend the domain of dialog systems, Gašic et al. (2014) explicitly defined kernel functions between the belief states that come from different domains. However, defining an appropriate kernel function is nontrivial when the ontology has changed drastically. Shah et al. (2016) proposed to integrate turnlevel feedback with a task-level reward signal to learn how to handle new user intents. This approach alleviates the problem that arises from the difference between training and deployment phases. But it still fails when the developers have not considered all user actions in advance. Lipton et al. (2017) proposed to use BBQ-Networks to extend the domain. However, similar to Shah et al. (2016), the BBQ-Networks have reserved a few bits in the feature vector for new intents and slots. And system actions for handling new user actions have been considered in the original system design. This assumption is not practical enough. Compared to the existing domain extension methods, our work addresses a more practical problem: new intents and slots are unknown to the original system. If we need to extend the dialog system, we should design a new network architecture to represent new user actions and take new system actions into account. Knowledge Distillation Our proposed framework is inspired by recent work in knowledge distillation (Bucilu et al., 2006;Ba and Caruana, 2014;Li et al., 2014). Knowledge distillation means training a compact model to mimic a larger teacher model by approximating the function learned by the teacher. Hinton et al. (2015) introduced knowledge distillation to transfer knowledge from  Figure 2: An overview of the RL-based dialog manager used in our work 3 . In the last turn, the system inquires "Where do you want to go?". In current turn, the user input is "Find a restaurant in Beihai.".
a large highly regularized model into a smaller model. The knowledge which can be transferred has not been restricted to models. Stewart and Ermon (2017) proposed to distill the physics and domain knowledge to train neural networks without labeled data. Hu et al. (2016) enabled a neural network to learn simultaneously from labeled instances as well as logic rules. Zhang et al. (2017) integrated multiple prior knowledge sources into neural machine translation using posterior regularization. Our experiments are based on such insights. Through defining appropriate regularization terms, we can distill different knowledge (e.g., trained model or prior knowledge) to a new designed model, alleviating the need for new labeled data or expensive interaction environments.

RL-based Dialog Manager
Before going to the details of our method, we provide some background on the RL-based dialog manager in this section. Fig. 2 shows an overview of such dialog manager. We describe each of the parts briefly below. Feature Extraction At the t-th turn of a dialog, the user input u t is parsed into domain specific intents and slots to form a semantic frame a u t by a language understanding (LU) module. o u t and o s t−1 are the one-hot representations of such semantic frames for the current user input and the last system output respectively. Alternatively, o u the vocabulary size is relatively large in real-world applications. It will yield slow convergence in the absence of a LU module. Based on the slot-value pair output with the highest probability, a query is sent to a database to retrieve user requested information. o db t is the one-hot representation of the database result. As a result, the observable information x t is the concatenation of o u t , o s t−1 and o db t . State Representation Based on the extracted feature vector x t and previous internal state s t−1 , recurrent neural networks (RNNs) are used to infer the latent representation of dialog state s t at step t. Current state s t can be interpreted as the summary of dialog history h t up to current step. Dialog Policy Next, the dialog state representation s t is fed into a policy network. The output π(a|h t ; θ) of the policy network is a probability distribution over a predefined system action set A s . Lastly, the system samples an action a s t ∈ A s based on π(a|h t ; θ) and receives a new observation x t+1 with an assigned reward r t . The policy parameters θ can be learned by maximizing the expected discounted cumulative rewards: where T is the maximal step, and γ is the discount factor. Usually the parameters θ can be iteratively updated by policy gradient (Williams, 1992) approach. The policy gradient can be empirically estimated as: where N is the number of sampled episodes in a batch, G i,t = T −t k=0 γ k r i,t+k is the sum of discounted reward at step t in the episode i, and b is a baseline to estimate the average reward of current policy.

Notations and Problem Definition
Let A u and A s denote the supported user and system action sets in the original system design respectively. u t denotes the user input in the t-th turn. The LU module converts u t into a domain specific intent and associated slots to form a user action a u t ∈ A u . The system will return an action a s t ∈ A s according to the dialog manager π(θ). Note that not all user actions are taken into account at the beginning of system design. After deployment, the developers can find that some user actions A u new cannot be handled by the original system based on the human-machine interaction logs D. Generally speaking, A u new consists of new intents and slots. Our goal is to extend the original system to support the new user action set A u = A u ∪A u new . The extended dialog manager and new system action set are denoted as π(θ ) and A s respectively. To handle new user actions, more system actions may be added to the new system. It means that A s is a subset of A s . Fig. 3 shows two kinds of strategies to extend the original system. The first strategy requires a new interaction environment. However, building a user simulator or hiring real users once the system needs to be extended is costly and impractical in real-world applications. By contrast, our method enhances the reuse of existing resources. The basic idea is to use the existing user logs, original dialog policy model and logic rules ("teacher") to guide the learning process of a new dialog manager ("student"). Without an expensive interaction environment, the developers can maintain RL-based dialog systems as efficiently and straightforwardly as in rule-based systems.

Distill Knowledge from the Original System
Although the ontology of the new system is different from the original one, the extended dialog manager can still reuse dialog policy of the illconsidered system circuitously. Given user logs D and the original dialog manager π(θ), we define a loss L(θ ; D, θ) to minimize the difference between new dialog manager π(θ ) and the old one: where π(a|h t ; θ) and π(a|h t ; θ ) are the policy distributions over A s and A s given the same dialog history h t . |d| means turns of a specific dialog d ∈ D. To deal with unsupported user actions, A s will be a subset of A s . As a result, the KL term in equation (3) can be defined as follows: KL( π(a|ht; θ) || π(a|ht; θ ) ) = |As| k=1 π(a k |ht; θ) logπ(a k |ht; θ) − logπ(a k |ht; θ ) As the original policy parameters θ are fixed, the loss function in equation (3) can be rewritten as: This objective will transfer knowledge of the original system to the "student" at the turn level. Under the guidance of the original system, the extended system will be equipped with the primary strategy to complete a task.

Distill Knowledge from Logic Rules
It's easy for the developers to give logic rules on the system responses to handle new user actions.
For example, if users ask to confirm a slot, the system should inform the value of that slot immediately. Note that these system actions which handle new user actions may not exist in the old model. It means the architecture of the new system is different from the old one.
We define a set of logic constraints R = {(h l , a l )} L l=1 , where h l ∈ H R indicates the dialog context condition in the l-th rule, and a l ∈ A s is the corresponding system action. The number of logic rules L is equal to the number of new user actions. These rules can be seen as triggers: if dialog context h t in current turn t meets the context condition h l defined in logic rules, then the system should execute a l . In our work, we use the output of the LU module to judge whether the current dialog context meets the condition defined by logic rules. An alternative method is simple rules matching. To distill the knowledge of rules to a new system, we define a loss function L(θ ; D, R) to embed such constraints in the new system: Where 1{·} is the indicate function. Equation (6) suggests the new dialog manager π(θ ) will be penalized if it violates the instructions defined by the dialog rules. Note that, for simplicity, we assume these rules are absolutely correct and mutually exclusive. Although this hypothesis may lead to a non-optimal dialog system, these rules define reasonable system actions to corresponding dialog contexts. It implies that the new system can be further refined by reinforcement learning once a new interaction environment is available.

Extension of Dialog Manager
In the absence of a new training environment, learning is made possible by exploiting structure that holds in the new dialog manager. On one hand, we expect the new system can complete tasks like the original one. On the other hand, it should satisfy the constraints defined by dialog rules. So, the learning objective of new dialog manager π(θ ) can be defined as follows: When the dialog context h t in the t-th turn satisfies a condition defined in H R , we distill knowledge of rules into the new system. Otherwise, we distill knowledge of the original system into the new one. Instead of retraining from scratch, developers can extend RL-based systems by reusing existing resources.

Experiments
To evaluate our method, we conduct experiments on a dialog system extension task of restaurant domain.

Domain
The dialog system provides restaurant information in Beijing. The database we use includes 2988 restaurants. This domain consists of 8 slots (name, area, price range, cuisine, rating, number of comments, address and phone number) in which the first four slots (inform slots) can be used for searching the desirable restaurant and all of these slots (request slots) can be asked by users. In each dialog, the user has a goal containing a set of slots, indicating the constraints and requests from users. For example, an inform slot, such as "inform(cuisine=Sichuan cuisine)", indicates the user finding a Sichuan restaurant, and a request slot, such as "request(area)", indicates the user asking for information from the system (Li et al., 2016(Li et al., , 2017bPeng et al., 2017).

Measurements
A main advantage of our approach is that the unconsidered user actions can be handled in the extended system. In addition to traditional measurements (e.g., success rate, average turns and average reward), we define an objective measurement called "Satis." (user satisfaction) to verify this feature in the simulated evaluation. "Satis." indicates the rate at which the system takes reasonable actions in unsupported dialog situations. It can be calculated as follows: where h t and a s t are the dialog history and system action in the t-th turn, h l and a l are dialog context condition and corresponding system action defined in the l-th rules. Intuitively, an unreasonable system reply will frustrate users and low "Satis." indicates a poor user experience.  Although "Satis." is obtained based on our handcrafted dialog rules, it approximately measures the subjective experience of real users after system deployment.

User Simulator
Training RL-based dialog systems requires a large number of interactions with users. It's common to use a user simulator to train RL-based dialog systems in an online fashion (Pietquin and Dutoit, 2006;Scheffler and Young, 2002;Li et al., 2016). As a consequence, we construct an agenda-based user simulator, which we refer to as Sim1, to train the original RL-based system. The user action set of Sim1 is denoted as A u , which includes such intents 4 : "hello", "bye", "inform", "deny", "negate", "affirm", "request", "reqalts" and "null". The slots of Sim1 are shown in section 6.1. In each turn, the user action consists of a intent and slots and we append the value of slots according to the user goal.

Implementation of the Original System
For the original RL-based dialog system, a feature vector x t of size 191 is extracted. This vector is the concatenation of encodings of LU results, the previous system reply, database results and the current turn number. The LU module is implemented with an SVM 5 for intent detection and a CRF 6 for slot filling. The language generation module is implemented by a rule-based method. The hidden dialog state representation is inferred 4 A detail explanation of these intents is in DSTC2 (Henderson et al., 2013). 5 We use the publicly available SVM tool at http://scikitlearn.org. 6 We use the publicly available CRF tool at https://pypi.python.org/pypi/sklearn-crfsuite. by a GRU (Chung et al., 2014). We set the hidden states of the GRU to be 120. The policy network is implemented as a Multilayer Perceptron (MLP) with one hidden layer. The size of the hidden layer is 80. The output dimension of policy network is 15, which corresponds to the number of system actions. To encourage shorter interaction, we set a small per-turn negative reward R turn = −1. The maximal turn is set to be 40. If the user goal is satisfied, the policy will be encouraged by a large positive reward R succ = 10; otherwise the policy will be penalized by a negative reward R f ail = −10. Discounted factor γ = 0.9. The baseline b of current policy is estimated on sampled episodes in a batch. The batch size N is set to be 32. Adadelta (Zeiler, 2012) method is used to update model parameters. The original system S 1 is trained by interacting with Sim1. After about 2400 interactions, the performance of S 1 starts to converge.

Simulated Evaluation
To evaluate our approach, we design another user simulator, which we denote as Sim2, to simulate the unpredictable real customers. The user action set of Sim2 is denoted as A u . The difference between A u and A u is reflected on the domain specific intents 7 . Specifically, in addition to the intents of Sim1, A u includes the "confirm" intent. The difference in user action sets will result in different interaction strategies between Sim1 and Sim2. To verify whether a recommended restaurant meets his (her) constraints, Sim1 can only request what the value of a specific slot is, but Sim2 can request or confirm.
After obtaining the original system S 1 , we deploy it to interact with Sim1 and Sim2 respectively, under different LU error rates (Li et al., 2017a). In each condition, we simulate 3200 episodes to obtain the performance. Table 1   details of the test performance. Table 2 shows the statistics of turns when S 1 interacts with Sim2.
As shown in Table 1, S 1 achieves higher dialog success rate and rewards when testing with Sim1. When interacting with Sim2, nearly half of the responses to unsupported user actions are not reasonable. Notice even though Sim2 contains new user actions, some of the new actions might be appropriately handled by S 1 . It may be due to the robustness of our RL-based system. But it's far from being desired. The unpredictable real user behavior in the deployment stage will lead to a poor user experience in real-world applications. It proves the importance of a maintainable system.
To maintain the original system, we define a few simple logic rules to handle unsupported user actions: if users confirm the value of a slot in current turn, the system should inform users of that value. These rules 8 are intuitive and reasonable to handle queries such as "Is this restaurant located in Zhongguancun?". There are four slots 9 that can be used for confirmation, so we define four logic rules in all. Due to the change in ontology, we add a new status in dialog features to represent the "confirm" intent of users. It leads to a change in the model architecture of extended dialog manager. Then we distill knowledge of the S 1 and logic rules into the extended system. No additional data is used to obtain the extended system.
For comparison, we retrain another new system (contrast system) from scratch by interacting 8 In the practical dialog system, we can inject more complex logic rules and take dialog history into account. These rules are not limited to question/answer mapping. 9 They are "name", "area", "price range" and "cuisine".
with Sim2. After about 2600 interactions with Sim2, the performance of contrast system starts to converge. Note that in order to build the contrast system, the developers need to redesign a new user simulator or hire real users. It's expensive and impractical in industrial applications. Then we simulate 3200 interactions with Sim2 to obtain its performance. Fig. 4 illustrates the performance of different systems. As can be seen, the extended system performs better than the original system in terms of dialog success rate and "Satis.". This is to a large degree attributed to the consideration of new user actions. Fig. 4(a) shows that the contrast system achieves higher dialog success rate than the extended system. But the gap is negligible. However, the contrast system is trained from scratch under a new interaction environment and the extended system is trained by transferring knowledge of the original system and logic rules. To train the contrast system, about 2600 episodes are sampled by interacting with a new interaction environment. But no additional data is used to train the extended system.
In Fig. 4(b), the "Satis." of the extended system is slightly higher than the contrast system. This is due to the fact that the extended system learns how to deal with new user actions from logic rules but the contrast system obtains dialog policy by exploring the environment. As a result, the contrast system learns a more flexible dialog policy than the extended system 10 . However, the "Satis." has a bias to the suboptimal rules,   Left column shows the dialog context condition; Right column shows the corresponding system action. We define 14 rules in all to handle newfound intents and slots shown in Table 3. rather than the optimal policy gained from the environment. It suggests the extended system can be further refined by reinforcement learning once a new interaction environment is available.

Human Evaluation
In any case, the developers can't guarantee all user actions are considered. Fortunately, our method makes no assumptions about the new user actions and new dialog model architecture. As a result, the system can be extended over multiple iterations.
To evaluate this characteristic, we deploy the extended system 11 in section 6.5 to interact with real human users. Users are given a goal sampled from our corpus for reference. To elicit more complex situations, they are encouraged to interact with our system by new intents and slots related to the restaurant domain. At the end of each dialog, they are asked to give a subjective rating on the scale from 1 to 5 based on the naturalness of the system (1 is the worst, 5 is the best.). After filtering dialog sessions unrelated to our task, we collect 315 episodes in total. Table 3 shows the details of the user logs. As shown in Table 3, after deployment, there are a few slots 11 The extended system in the simulated evaluation will be the original system in our human evaluation.

Original System
Extended System and intents unseen before. For example, users may ask for the discount information or take a taxi to the restaurant. To represent the new intents and slots, the dimension of extracted dialog features is extended to 236. Meanwhile, the number of system actions is extended to 29 to handle new user actions. To deal with the newfound user actions, we define 14 rules in total. Table 4 shows the details of new defined logic rules. Then we distill the knowledge of the original system and logic rules into a new system. Fig. 5 shows the comparison in user ratings. The extended system significantly gets a higher subjective rating than the original one. It proves that the extended system can give reasonable responses in unseen dialog situations. Table 5 shows sample dialogs from the original system and extended system with real users. We can see that the extended system is much more coherent since it takes new user actions into account. It inspires us that a complicated RL-based dialog system can start from a simple one and be improved incrementally based on our framework. This design pattern will be much more efficient if the interaction environment is hard to access.

Conclusion
In this paper, we present a practical solution to maintain RL-based dialog systems without retraining from scratch. By reusing existing resources, developers can extend the RL-based dialog manager to deal with unpredictable user actions after deployment. Furthermore, our method holds no constraints on the architecture of the new system and retains the key benefit of data-driven methods. It means the new system can be maintained iteratively and further refined by RL once a new interaction environment is available. The sim-

Original System
Extended System User: I want to find a Sichuan restaurant.
User: I want to find a Sichuan restaurant in System: Where do you want to go? Zhongguancun area. User: Zhongguancun area.
System: "Hot Charge" is a nice Sichuan restaurant in System: "Hot Charge" is a nice Sichuan restaurant in in Zhongguancun area. Zhongguancun area.
User: What's the discount of this restaurant? User: Can you tell me the discount?
System: This restaurant is offering 10% off. System: The average cost is 42 CNY.
User: Is there a waiting line? User: Tell me the discount! System: Currently there are 5 parties waiting in line. System: "Hot Charge" is a nice Sichuan restaurant in User: Can you book a table for me? Zhongguancun area.
System: You are welcome. Subjective Rating = 2 Subjective Rating = 5 Table 5: Sample dialogs of the original and extended systems in the human evaluation. The original system will collapse when encountering unseen slots and intents. After extension, the new system can handle more user actions.
ulation and human experiments show our proposed framework guarantees the maintainability and scalability in RL-based systems, which are necessary for any industrial application.