Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning

End-to-end learning of recurrent neural networks (RNNs) is an attractive solution for dialog systems; however, current techniques are data-intensive and require thousands of dialogs to learn simple behaviors. We introduce Hybrid Code Networks (HCNs), which combine an RNN with domain-specific knowledge encoded as software and system action templates. Compared to existing end-to-end approaches, HCNs considerably reduce the amount of training data required, while retaining the key benefit of inferring a latent representation of dialog state. In addition, HCNs can be optimized with supervised learning, reinforcement learning, or a mixture of both. HCNs attain state-of-the-art performance on the bAbI dialog dataset (Bordes and Weston, 2016), and outperform two commercially deployed customer-facing dialog systems at our company.


Introduction
Task-oriented dialog systems help a user to accomplish some goal using natural language, such as making a restaurant reservation, getting technical support, or placing a phonecall. Historically, these dialog systems have been built as a pipeline, with modules for language understanding, state tracking, action selection, and language generation. However, dependencies between modules introduce considerable complexity -for example, it is often unclear how to define the dialog state and what history to maintain, yet action selection relies exclusively on the state for input. Moreover, training each module requires specialized labels. * Currently at JPMorgan Chase Recently, end-to-end approaches have trained recurrent neural networks (RNNs) directly on text transcripts of dialogs. A key benefit is that the RNN infers a latent representation of state, obviating the need for state labels. However, end-to-end methods lack a general mechanism for injecting domain knowledge and constraints. For example, simple operations like sorting a list of database results or updating a dictionary of entities can expressed in a few lines of software, yet may take thousands of dialogs to learn. Moreover, in some practical settings, programmed constraints are essential -for example, a banking dialog system would require that a user is logged in before they can retrieve account information.
This paper presents a model for end-to-end learning, called Hybrid Code Networks (HCNs) which addresses these problems. In addition to learning an RNN, HCNs also allow a developer to express domain knowledge via software and action templates. Experiments show that, compared to existing recurrent end-to-end techniques, HCNs achieve the same performance with considerably less training data, while retaining the key benefit of end-to-end trainability. Moreover, the neural network can be trained with supervised learning or reinforcement learning, by changing the gradient update applied. This paper is organized as follows. Section 2 describes the model, and Section 3 compares the model to related work. Section 4 applies HCNs to the bAbI dialog dataset (Bordes and Weston, 2016). Section 5 then applies the method to real customer support domains at our company. Section 6 illustrates how HCNs can be optimized with reinforcement learning, and Section 7 concludes.

Model description
At a high level, the four components of a Hybrid Code Network are a recurrent neural network; domain-specific software; domain-specific action templates; and a conventional entity extraction module for identifying entity mentions in text. Both the RNN and the developer code maintain state. Each action template can be a textual communicative action or an API call. The HCN model is summarized in Figure 1.
The cycle begins when the user provides an utterance, as text (step 1). The utterance is featurized in several ways. First, a bag of words vector is formed (step 2). Second, an utterance embedding is formed, using a pre-built utterance embedding model (step 3). Third, an entity extraction module identifies entity mentions (step 4) -for example, identifying "Jennifer Jones" as a <name> entity. The text and entity mentions are then passed to "Entity tracking" code provided by the developer (step 5), which grounds and maintains entitiesfor example, mapping the text "Jennifer Jones" to a specific row in a database. This code can optionally return an "action mask", indicating actions which are permitted at the current timestep, as a bit vector. For example, if a target phone number has not yet been identified, the API action to place a phone call may be masked. It can also optionally return "context features" which are features the developer thinks will be useful for distinguish-ing among actions, such as which entities are currently present and which are absent.
The feature components from steps 1-5 are concatenated to form a feature vector (step 6). This vector is passed to an RNN, such as a long shortterm memory (LSTM) (Hochreiter and Schmidhuber, 1997) or gated recurrent unit (GRU) (Chung et al., 2014). The RNN computes a hidden state (vector), which is retained for the next timestep (step 8), and passed to a dense layer with a softmax activation, with output dimension equal to the number of distinct system action templates (step 9). 1 Thus the output of step 9 is a distribution over action templates. Next, the action mask is applied as an element-wise multiplication, and the result is normalized back to a probability distribution (step 10) -this forces non-permitted actions to take on probability zero. From the resulting distribution (step 11), an action is selected (step 12). When RL is active, exploration is required, so in this case an action is sampled from the distribution; when RL is not active, the best action should be chosen, and so the action with the highest probability is always selected.
The selected action is next passed to "Entity output" developer code that can substitute in entities (step 13) and produce a fully-formed action -for example, mapping the template "<city>, right?" to "Seattle, right?". In step 14, control branches depending on the type of the action: if it is an API action, the corresponding API call in the developer code is invoked (step 15) -for example, to render rich content to the user. APIs can act as sensors and return features relevant to the dialog, so these can be added to the feature vector in the next timestep (step 16). If the action is text, it is rendered to the user (step 17), and cycle then repeats. The action taken is provided as a feature to the RNN in the next timestep (step 18).

Related work
Broadly there are two lines of work applying machine learning to dialog control. The first decomposes a dialog system into a pipeline, typically including language understanding, dialog state tracking, action selection policy, and language generation (Levin et al., 2000;Singh et al., 2002;Williams and Young, 2007;Williams, 2008;Hori et al., 2009;Lee et al., 2009;Griol et al., 2008;Young et al., 2013;Li et al., 2014). Specifically related to HCNs, past work has implemented the policy as feed-forward neural networks , trained with supervised learning followed by reinforcement learning . In these works, the policy has not been recurrent -i.e., the policy depends on the state tracker to summarize observable dialog history into state features, which requires design and specialized labeling. By contrast, HCNs use an RNN which automatically infers a representation of state. For learning efficiency, HCNs use an external lightweight process for tracking entity values, but the policy is not strictly dependent on it: as an illustration, in Section 5 below, we demonstrate an HCNbased dialog system which has no external state tracker. If there is context which is not apparent in the text in the dialog, such as database status, this can be encoded as a context feature to the RNN.
The second, more recent line of work applies recurrent neural networks (RNNs) to learn "endto-end" models, which map from an observable dialog history directly to a sequence of output words (Sordoni et al., 2015;Shang et al., 2015;Vinyals and Le, 2015;Yao et al., 2015;Serban et al., 2016;Li et al., 2016a,c;Luan et al., 2016;Xu et al., 2016;Li et al., 2016b;Mei et al., 2016;. These systems can be applied to task-oriented domains by adding special "API call" actions, enumerating database output as a sequence of tokens (Bordes and Weston, 2016), then learning an RNN using Memory Networks (Sukhbaatar et al., 2015), gated memory networks (Liu and Perez, 2016), query reduction networks (Seo et al., 2016), and copyaugmented networks (Eric and Manning, 2017). In each of these architectures, the RNN learns to manipulate entity values, for example by saving them in a memory. Output is produced by generating a sequence of tokens (or ranking all possible surface forms), which can also draw from this memory. HCNs also use an RNN to accumulate dialog state and choose actions. However, HCNs differ in that they use developer-provided action templates, which can contain entity references, such as "<city>, right?". This design reduce learning complexity, and also enable the software to limit which actions are available via an action mask, at the expense of developer effort. To further reduce learning complexity in a practical system, entities are tracked separately, outside the the RNN, which also allows them to be substituted into action templates. Also, past end-to-end recurrent models have been trained using supervised learning, whereas we show how HCNs can also be trained with reinforcement learning.

Supervised learning evaluation I
In this section we compare HCNs to existing approaches on the public "bAbI dialog" dataset (Bordes and Weston, 2016). This dataset includes two end-to-end dialog learning tasks, in the restaurant domain, called task5 and task6. 2 Task5 consists of synthetic, simulated dialog data, with highly regular user behavior and constrained vocabulary. Dialogs include a database access action which retrieves relevant restaurants from a database, with results included in the dialog transcript. We test on the "OOV" variant of Task5, which includes entity values not observed in the training set. Task6 draws on human-computer dialog data from the second dialog state tracking challenge (DSTC2), where usability subjects (crowd-workers) interacted with several variants of a spoken dialog system (Henderson et al., 2014a). Since the database from DSTC2 was not provided, database calls have been inferred from the data and inserted into the dialog transcript. Example dialogs are provided in the Appendix Sections A.2 and A.3.
To apply HCNs, we wrote simple domain-specific software, as follows. First, for entity extraction (step 4 in Figure 1), we used a simple string match, with a pre-defined list of entity names -i.e., the list of restaurants available in the database. Second, in the context update (step 5), we wrote simple logic for tracking entities: when an entity is recognized in the user input, it is retained by the software, over-writing any previously stored value. For example, if the price "cheap" is recognized in the first turn, it is retained as price=cheap. If "expensive" is then recognized in the third turn, it over-writes "cheap" so the code now holds price=expensive. Third, system actions were templatized: for example, system actions of the form "prezzo is a nice restaurant in the west of town in the moderate price range" all map to the template "<name> is a nice restaurant in the <location> of town in the <price> price range". This results in 16 templates for Task5 and 58 for Task6. 3 Fourth, when database results are received into the entity state, they are sorted by rating. Finally, an action mask was created which encoded common-sense dependencies. These are implemented as simple if-then rules based on the presence of entity values: for example, only allow an API call if pre-conditions are met; only offer a restaurant if database results have already been received; do not ask for an entity if it is already known; etc. For Task6, we noticed that the system can say that no restaurants match the current query without consulting the database (for an example dialog, see Section A.3 in the Appendix). In a practical system this information would be retrieved from the database and not encoded in the RNN. So, we mined the training data and built a table of search queries known to yield no results. We also added context features that indicated the state of the database -for example, whether there were any restaurants matching the current query. The complete set of context features is given in Appendix Section A.4. Altogether this code consisted of about 250 lines of Python.
We then trained an HCN on the training set, employing the domain-specific software described above. We selected an LSTM for the recurrent layer (Hochreiter and Schmidhuber, 1997), with the AdaDelta optimizer (Zeiler, 2012). We used the development set to tune the number of hid-den units (128), and the number of epochs (12). Utterance embeddings were formed by averaging word embeddings, using a publicly available 300dimensional word embedding model trained using word2vec on web data (Mikolov et al., 2013). 4 The word embeddings were static and not updated during LSTM training. In training, each dialog formed one minibatch, and updates were done on full rollouts (i.e., non-truncated back propagation through time). The training loss was categorical cross-entropy. Further low-level implementation details are in the Appendix Section A.1.
We ran experiments with four variants of our model: with and without the utterance embeddings, and with and without the action mask (Figure 1, steps 3 and 6 respectively).
Following past work, we report average turn accuracy -i.e., for each turn in each dialog, present the (true) history of user and system actions to the network and obtain the network's prediction as a string of characters. The turn is correct if the string matches the reference exactly, and incorrect if not. We also report dialog accuracy, which indicates if all turns in a dialog are correct.
We compare to four past end-to-end approaches (Bordes and Weston, 2016;Liu and Perez, 2016;Eric and Manning, 2017;Seo et al., 2016). We emphasize that past approaches have applied purely sequence-to-sequence models, or (as a baseline) purely programmed rules (Bordes and Weston, 2016). By contrast, Hybrid Code Networks are a hybrid of hand-coded rules and learned models.
Results are shown in Table 1. Since Task5 is synthetic data generated using rules, it is possible to obtain perfect accuracy using rules (line 1). The addition of domain knowledge greatly simplifies the learning task and enables HCNs to also attain perfect accuracy. On Task6, rules alone fare poorly, whereas HCNs outperform past learned models.
We next examined learning curves, training with increasing numbers of dialogs. To guard against bias in the ordering of the training set, we averaged over 5 runs, randomly permuting the order of the training dialogs in each run. Results are in Figure 2. In Task5, the action mask and utterance embeddings substantially reduce the number of training dialogs required (note the horizontal axis scale is logarithmic). For Task6, the bene-   Task5-OOV and Task6. "embed" indicates whether utterance embeddings were included; "mask" indicates whether the action masking code was active.
fits of the utterance embeddings are less clear. An error analysis showed that there are several systematic differences between the training and testing sets. Indeed, DSTC2 intentionally used different dialog policies for the training and test sets, whereas our goal is to mimic the policy in the training set.
Nonetheless, these tasks are the best public benchmark we are aware of, and HCNs exceed performance of existing sequence-to-sequence models. In addition, they match performance of past models using an order of magnitude less data (200 vs. 1618 dialogs), which is crucial in practical settings where collecting realistic dialogs for a new domain can be expensive.

Supervised learning evaluation II
We now turn to comparing with purely handcrafted approaches. To do this, we obtained logs from our company's text-based customer support dialog system, which uses a sophisticated rulebased dialog manager. Data from this system is attractive for evaluation because it is used by real customers -not usability subjects -and because its rule-based dialog manager was developed by customer support professionals at our company, and not the authors. This data is not publicly available, but we are unaware of suitable humancomputer dialog data in the public domain which uses rules.
Customers start using the dialog system by entering a brief description of their problem, such as "I need to update my operating system". They are then routed to one of several hundred domains, where each domain attempts to resolve a particular problem. In this study, we collected humancomputer transcripts for the high-traffic domains "reset password" and "cannot access account".
We labeled the dialog data as follows. First, we enumerated unique system actions observed in the data. Then, for each dialog, starting from the beginning, we examined each system action, and determined whether it was "correct". Here, correct means that it was the most appropriate action among the set of existing system actions, given the history of that dialog. If multiple actions were arguably appropriate, we broke ties in favor of the existing rule-based dialog manager. Example dialogs are provided in the Appendix Sections A.5 and A.6.
If a system action was labeled as correct, we left it as-is and continued to the next system action. If the system action was not correct, we replaced it with the correct system action, and discarded the rest of the dialog, since we do not know how the user would have replied to this new system action. The resulting dataset contained a mixture of complete and partial dialogs, containing only correct system actions. We partitioned this set into training and test dialogs. Basic statistics of the data are shown in Table 2.
In this domain, no entities were relevant to the control flow, and there was no obvious mask logic since any question could follow any question. Therefore, we wrote no domain-specific software for this instance of the HCN, and relied purely on the recurrent neural network to drive the conversation. The architecture and training of the RNN was the same as in Section 4, except that here we did not have enough data for a validation set, so we instead trained until we either achieved 100% accuracy on the training set or reached 200 epochs.
To evaluate, we observe that conventional measures like average dialog accuracy unfairly penalize the system used to collect the dialogs -in our case, the rule-based system. If the system used for collection makes an error at turn t, the labeled dialog only includes the sub-dialog up to turn t, and the system being evaluated off-line is only evaluated on that sub-dialog. In other words, in our case, reporting dialog accuracy would favor the HCN because it would be evaluated on fewer turns than the rule-based system. We therefore  , where C(HCN-win) is the number of test dialogs where the rule-based approach output a wrong action before the HCN; C(rule-win) is the number of test dialogs where the HCN output a wrong action before the rulebased approach; and C(all) is the number of dialogs in the test set. When ∆P > 0, there are more dialogs in which HCNs produce longer continuous sequences of correct actions starting from the beginning of the dialog. We run all experiments 5 times, each time shuffling the order of the training set. Results are in Figure 3. HCNs exceed performance of the existing rule-based system after about 30 dialogs.
In these domains, we have a further source of knowledge: the rule-based dialog managers themselves can be used to generate example "sunnyday" dialogs, where the user provides purely expected inputs. From each rule-based controller, synthetic dialogs were sampled to cover each expected user response at least once, and added to the set of labeled real dialogs. This resulted in 75 dialogs for the "Forgot password" domain, and 325 for the "Can't access account" domain. Training was repeated as described above. Results are also included in Figure 3, with the suffix "sampled". In the "Can't access account" domain, the sampled dialogs yield a large improvement, probably because the flow chart for this domain is large, so the sampled dialogs increase coverage. The gain in the "forgot password" domain is present but smaller.
In summary, HCNs can out-perform  Figure 3: Training dialogs vs. ∆P , where ∆P is the fraction of test dialogs where HCNs produced longer initial correct sequences of system actions than the rules, minus the fraction where rules produced longer initial correct sequences than the HCNs. "embed" indicates whether utterance embeddings were included; "sampled" indicates whether dialogs sampled from the rule-based controller were included in the training set.
production-grade rule-based systems with a reasonable number of labeled dialogs, and adding synthetic "sunny-day" dialogs improves performance further. Moreover, unlike existing pipelined approaches to dialog management that rely on an explicit state tracker, this HCN used no explicit state tracker, highlighting an advantage of the model.

Reinforcement learning illustration
In the previous sections, supervised learning (SL) was applied to train the LSTM to mimic dialogs provided by the system developer. Once a system operates at scale, interacting with a large number of users, it is desirable for the system to continue to learn autonomously using reinforcement learning (RL). With RL, each turn receives a measurement of goodness called a reward; the agent explores different sequences of actions in different situations, and makes adjustments so as to maximize the expected discounted sum of rewards, which is called the return, denoted G. For optimization, we selected a policy gradient approach (Williams, 1992), which has been successfully applied to dialog systems (Jurčíček et al., 2011), robotics (Kohl and Stone, 2004), and the board game Go (Silver et al., 2016). In policy gradient-based RL, a model π is parameterized by w and outputs a distribution from which actions are sampled at each timestep. At the end of a trajectory -in our case, dialog -the return G for that trajectory is computed, and the gradients of the probabilities of the actions taken with respect to the model weights are computed. The weights are then adjusted by taking a gradient step proportional to the return: where α is a learning rate; a t is the action taken at timestep t; h t is the dialog history at time t; G is the return of the dialog; x F denotes the Jacobian of F with respect to x; b is a baseline described below; and π(a|h; w) is the LSTM -i.e., a stochastic policy which outputs a distribution over a given a dialog history h, parameterized by weights w. The baseline b is an estimate of the average return of the current policy, estimated on the last 100 dialogs using weighted importance sampling. 5 Intuitively, "better" dialogs receive a positive gradient step, making the actions selected more likely; and "worse" dialogs receive a negative gradient step, making the actions selected less likely. SL and RL correspond to different methods of updating weights, so both can be applied to the same network. However, there is no guarantee that the optimal RL policy will agree with the SL training set; therefore, after each RL gradient step, we check whether the updated policy reconstructs the training set. If not, we re-run SL gradient steps on the training set until the model reproduces the training set. Note that this approach allows new training dialogs to be added at any time during RL optimization.
We illustrate RL optimization on a simulated dialog task in the name dialing domain. In this system, a contact's name may have synonyms ("Michael" may also be called "Mike"), and a contact may have more than one phone number, such as "work" or "mobile", which may in turn have synonyms like "cell" for "mobile". This domain has a database of names and phone numbers taken from the Microsoft personnel directory, 5 entity typesfirstname, nickname, lastname, phonenumber, and phonetype -and 14 actions, including 2 API call actions. Simple entity logic was coded, which retains the most recent copy of recognized entities. A simple action mask suppresses impossible actions, such as placing a phonecall before a phone number has been retrieved from the database. Example dialogs are provided in Appendix Section A.7.
To perform optimization, we created a simulated user. At the start of a dialog, the simulated user randomly selected a name and phone type, including names and phone types not covered by the dialog system. When speaking, the simulated user can use the canonical name or a nickname; usually answers questions but can ignore the system; can provide additional information not requested; and can give up. The simulated user was parameterized by around 10 probabilities, set by hand.
We defined the reward as being 1 for successfully completing the task, and 0 otherwise. A discount of 0.95 was used to incentivize the system to complete dialogs faster rather than slower, yielding return 0 for failed dialogs, and G = 0.95 T −1 for successful dialogs, where T is the number of system turns in the dialog. Finally, we created a set of 21 labeled dialogs, which will be used for supervised learning.
For the RNN in the HCN, we again used an LSTM with AdaDelta, this time with 32 hidden units. RL policy updates are made after each dialog. Since a simulated user was employed, we did not have real user utterances, and instead relied on context features, omitting bag-of-words and utterance embedding features.
LSTM, and begin RL optimization. After 10 RL updates, we freeze the policy, and run 500 dialogs with the user simulation to measure task completion. We repeat all of this for 100 runs, and report average performance. In addition, we also report results by initializing the LSTM using supervised learning on the training set, consisting of 1, 2, 5, or 10 dialogs sampled randomly from the training set, then running RL as described above.
Results are in Figure 4. Although RL alone can find a good policy, pre-training with just a handful of labeled dialogs improves learning speed dramatically. Additional experiments, not shown for space, found that ablating the action mask slowed training, agreeing with Williams (2008).
Finally, we conduct a further experiment where we sample 10 training dialogs, then add one to the training set just before RL dialog 0, 100, 200, ... , 900. Results are shown in Figure 4. This shows that SL dialogs can be introduced as RL is in progress -i.e., that it is possible to interleave RL and SL. This is an attractive property for practical systems: if a dialog error is spotted by a developer while RL is in progress, it is natural to add a training dialog to the training set.

Conclusion
This paper has introduced Hybrid Code Networks for end-to-end learning of task-oriented dialog systems. HCNs support a separation of concerns where procedural knowledge and constraints can be expressed in software, and the control flow is learned. Compared to existing end-to-end approaches, HCNs afford more developer control and require less training data, at the expense of a small amount of developer effort.
Results in this paper have explored three different dialog domains. On a public benchmark in the restaurants domain, HCNs exceeded performance of purely learned models. Results in two troubleshooting domains exceeded performance of a commercially deployed rule-based system. Finally, in a name-dialing domain, results from dialog simulation show that HCNs can also be optimized with a mixture of reinforcement and supervised learning.
In future work, we plan to extend HCNs by incorporating lines of existing work, such as integrating the entity extraction step into the neural network (Dhingra et al., 2017), adding richer utterance embeddings (Socher et al., 2013), and supporting text generation (Sordoni et al., 2015). We will also explore using HCNs with automatic speech recognition (ASR) input, for example by forming features from n-grams of the ASR n-best results (Henderson et al., 2014b). Of course, we also plan to deploy the model in a live dialog system. More broadly, HCNs are a general model for stateful control, and we would be interested to explore applications beyond dialog systemsfor example, in NLP medical settings or humanrobot NL interaction tasks, providing domain constraints are important for safety; and in resourcepoor settings, providing domain knowledge can amplify limited data.

A.1 Model implementation details
The RNN was specified using Keras version 0.3.3, with back-end computation in Theano version 0.8.0.dev0 (Theano Development Team, 2016;Chollet, 2015). The Keras model specification is given below. The input variable obs includes all features from Figure 1 step 6 except for the previous action (step 18) and the action mask (step 6, top-most vector).