Resource Constrained Dialog Policy Learning Via Differentiable Inductive Logic Programming

Motivated by the needs of resource constrained dialog policy learning, we introduce dialog policy via differentiable inductive logic (DILOG). We explore the tasks of one-shot learning and zero-shot domain transfer with DILOG on SimDial and MultiWoZ. Using a single representative dialog from the restaurant domain, we train DILOG on the SimDial dataset and obtain 99+% in-domain test accuracy. We also show that the trained DILOG zero-shot transfers to all other domains with 99+% accuracy, proving the suitability of DILOG to slot-filling dialogs. We further extend our study to the MultiWoZ dataset achieving 90+% inform and success metrics. We also observe that these metrics are not capturing some of the shortcomings of DILOG in terms of false positives, prompting us to measure an auxiliary Action F1 score. We show that DILOG is 100x more data efficient than state-of-the-art neural approaches on MultiWoZ while achieving similar performance metrics. We conclude with a discussion on the strengths and weaknesses of DILOG.

However, collecting annotated data for supervised dialog policy learning is an expensive and timeconsuming process. Hence, it is desirable to explore approaches to train dialog policy with limited data and transfer an existing policy with few or even no additional training data to new domains. This practical requirement has motivated the community to research resource-constrained dialog policy learning in the past few decades. Researchers have explored approaches including employing grammar constraints for dialog policy (Eshghi et al., 2017), transfer learning (Shalyminov et al., 2019), or pre-trained language models (Zhao et al., 2020). Few-shot domain adaptation has been researched since the 2000s (Litman and Pan, 2002) on both end-to-end dialog systems (Qian and Yu, 2019;Zhao and Eskenazi, 2018) as well as dialog policy learning (Vlasov et al., 2018).
In a traditional modular dialog system, the dialog policy aims to decide a dialog action given a dialog state, while assuming the tasks of language understanding and generation are handled by other components. Under such assumptions, a task-oriented dialog policy mostly follow the slot-filling scheme, which can be described by a set of probabilistic rules. Therefore, we hypothesize that dialog policy in this limited sense can be constructed by learning the underlying rules. To this end, we draw upon the recent advances in developing differentiable inductive logical programs (DILP) (Evans and Grefenstette, 2018) that use neural architectures to learn almost rule-based policies. We present DILOG, an adaptation of DILP to dialog policy learning. Briefly, DILOG discerns a set of logical rules from the examples by using inductive reasoning. 1 We introduce DILOG in Section 2. We apply DILOG to the SimDial Dataset (Zhao and Eskenazi, 2018) (Section 3), and MultiWoZ Dataset (Budzianowski et al., 2018) (Section 4), showing that on the task of one-shot dialog policy learning and zero-shot domain transfer, DILOG outperforms several other neural baselines. Finally, Section 5 concludes this paper.

DILOG: Dialog Policy Via Differentiable Inductive Logic
Inductive Logic Programming (ILP) is a paradigm which derives a hypothesized first-order logic given the background knowledge, positive, and negative examples (Muggleton and De Raedt, 1994). The central component of ILP is known as clauses. A clause is a rule expressed as α ← α 1 , ..., α n , where α is the head atom and α 1 , ..., α n are body atoms. An atom p(t 1 , ..., t m ) is composed of an m-ary predicate p and a tuple of terms t 1 , ..., t m , which can be variables or constants. An atom is ground if it only contains constants. For example, The following clause defines when to perform the action of confirmation: confirm(S) ← user request(S, T ), not confident(S), which means if the user requests a slot S in a task T , and the system is not confident about S, then the system should confirm S with user. Employing the clause above on the grounding atoms of user request(contact, calling) and not confident(contact), we will be able to deduce the action of confirm(contact). DILP (Evans and Grefenstette, 2018) combines ILP with a differentiable neural network architecture, to make ILP robust to noisy or ambiguous data. In short, DILP generates a collection of clauses based on the rule templates, 2 and assigns trainable weights to those clauses. Then logical deduction is applied recursively using the weighted sum of the clauses on the valuation vector a ∈ [0, 1] g , where g is the number of grounding atoms and a[i] is the probability that the grounding atom i is true. With the trainable weights, DILP can be trained using the gradient descent method.
DILOG is based on DILP with several modifications: (1) We include an option of adding 1 or 2 regularizers to the weight matrix of the clauses to improve generalization. (2) We allow adding clauses as background knowledge, so as to enable continual learning using pre-learned rules. (3) We use elementwise max instead of probabilistic sum as the amalgamate function to update the valuation vector. As probabilistic sum is accumulating the valuation at each step, it is easier to get into a local optimum when the inference steps are long. (4) In DILP, a problem is defined as (L, B, P, N ), where L is the language frame to generate potential clauses, B is the background knowledge, P is the positive examples, and N is the negative examples. In this definition, all positive and negative examples are grounded on the same set of constants C. As the number of clauses scales quadratically with the number of constants, it can be computationally expensive when the sample size is large. Comparatively, we define a problem as (L, S), where S is the set of samples, and each x ∈ S is a tuple of (B, P, N , C). This modification allows each sample to define its own set of constants, which makes it computationally tractable under the setting of dialog system where each dialog consists of multiple turns.
In this paper, we demonstrate the application of DILOG on two tasks in dialog policy learning: oneshot learning and zero-shot domain transfer. In a modular dialog system, which consists of automatic speech recognition, natural language understanding, dialog state tracking (DST), dialog manager, and natural language generation (NLG), the role of dialog policy is to map every state s ∈ S (represented by the DST) to a dialog action a ∈ A. Compared with end-to-end policy learning that generates a natural language response, we only focus on learning the policy function π : S → A, and leave NLG as a separate problem. Under this setting, DILOG offers several advantages: • Sample efficiency: DILOG generalizes well from a small number of samples by introducing a language bias from the template used to generate the clauses, with which a set of succinct rule is preferred over a set of complex rules. This makes it useful in one-shot policy learning.
• Interpretability: The dialog policy learned via DILOG consists of a set of (probabilistic) rules. Each of the rules can be manually inspected and understood. This feature is desirable in industrial settings where interpretability and debuggability are key considerations.
• Domain generalizability: The rules learned by DILOG in one domain can be zero-shot transferred to new domains with unseen slots, by assuming symmetry between slots. This helps quickly enable a new domain with no new data collection and annotation.

DILOG on the SimDial Dataset
SimDial (Zhao and Eskenazi, 2018) is a multi-domain dialog generator that can generate conversations in domains including restaurant, movie, bus, and weather. Each domain is defined by a collection of meta data including user slots and systems slots. Ignoring the actions of greeting and goodbye, possible user actions are inform or request, while system actions include inform, request, and database query. We use SimDial to generate clean dialogs from all four domains, among which a single representative dialog (that contains all user and system actions) from the restaurant domain is used for training and 500 dialogs from each other domains are used as the test set. To adapt the SimDial problem to DILOG, the delexicalized 3 state and actions are converted into the form of atoms, whose predicate is the action or state, and term is the slot. For example, request(loc) denotes the action of requesting the location, while unknown(price) indicates that the price is unknown to the system. Each turn is converted to a sample s, where the background knowledge B is the combination of user actions and the belief state, and the positive examples P are the system actions. DILOG learns a mapping (a set of clauses) from the background B to the positive examples P. See Table 1 for an illustrative example of the adaption steps. The detailed process is described in Appendix C.
Note that the four domains have different slots, and during training time, the model is only aware of the slots in the training set. During test time, the learned rules are directly applied to the converted samples in the test set. Additionally, to demonstrate continual learning, we add the pre-trained basic relationships of all and member to the background knowledge. all identifies that all items in a list satisfy some property, and member indicates that an item belongs to a list. See Appendix C for a more detailed description of the training process.  We compare DILOG with two baselines, the first one is Zero-Shot Dialog Generation (ZSDG) (Zhao and Eskenazi, 2018), which learns a cross-domain embedding of actions to enable the policy to zero-shot transfer to new domains. The second one is a naive multi-layer perceptron (MLP) model mapping from the encoding of the states to the actions. The models are trained and evaluated on the same datasets. We employ two metrics to quantify the performance of different models: Intent F1 measures whether the predicted dialog intent matches the ground truth, while Entity F1 measures whether the entity is predicted correctly.
The evaluation results are shown in Figure 1. The In-Domain column demonstrates the performance of one-shot learning (trained with one dialog), while the Out-of-Domain one shows the performance of zeroshot domain transfer. We also include the ZSDG and MLP models trained with 1000 samples (denoted with -1000) from the restaurant domain. On the SimDial dataset, DILOG consistently outperforms other models.
The better performance of DILOG on one-shot learning comes from the language bias induced by the template that used to generate all the clauses, which can be regarded as a form of regularization. The ability of zero-shot domain transfer can be attributed to the symmetry assumed by DILOG. For example the slots of parking and price are symmetrical in a sense that the rules applied to parking should be directly applicable to price as well. DILOG only breaks symmetry when necessary (for example, to differentiate between user slots and system slots), while maintaining the symmetry otherwise. However, in the vector-form encoding used by the neural networks, it is difficult, if not impossible, to express this symmetry.
The rules learned by DILOG can be extracted and interpreted by human beings. For example, the rule learned for when to request a slot is: sys request(V 0) ← member usr(V 0), unknown(V 0), which reads: If a slot is one of the user slots, and that slot is unknown, the system should request that slot. The full set of learned rules are listed in Appendix C.4. We also analyze an example that results in an error in Appendix C.5, which is made possible by the interpretability of the DILOG framework.

DILOG on the MultiWoZ Dataset
MultiWoZ 2.0 (Budzianowski et al., 2018) is a large-scale human to human English dialog dataset, which consists of dialogs on domains including restaurant, hotel, attraction, train, and taxi. It also includes dialogs that span multiple domains. The system response in MultiWoz dataset is stochastic, as human clerks have the freedom to choose from the possible actions. The actions are annotated by human annotators, which are noisy as well. These features make MultiWoZ a more challenging task compared with SimDial. We used a variant of MultiWoZ 2.0 provided by ConvLab , which has annotated user actions.
The process of adapting the MultiWoZ dataset to DILOG is similar to that of SimDial (please refer to Appendix D for the details). For training DILOG, we selected a single representative dialog whose system action contains all possible intents (inform, request, offerbooked, nooffer) from the restaurant Note that DAMD applied its own heuristics in data preprocessing, so the Act. F1 of DAMD is not directly comparable with others. DAMD with the same preprocessing steps as others has a worse performance (Table A3). The Inform and Success metrics are not affected by preprocessing and therefore comparable. All stands for the performance on the all domains in the test set (1000 samples, standard error: ±1.6%). Out-of-Domain is the average performance of hotel, attraction, train, and taxi (1480 samples, standard error: ±1.3%). See is the state-of-the-art model with dialog action prediction, and MLP is a naive baseline. We calculate the following metrics to evaluate the quality of the models: Inform is a per diaog metric that measures whether the system finally provides a correct entity, Success measures whether all the requested information are provided, and Action F1 is a per turn metric that checks if the predicted dialog action (both intent and entities) matches the ground truth. We use a template-based NLG to convert dialog actions predicted by DILOG and MLP to compute inform and success on generated utterances. There are three slots (postcode, phone, address) that only appear in the system actions but not dialog states. We manually add an option to the template-based NLG to inform those slots whenever an entity is informed, and denote that variant as -AM, which significantly improves the success rate. We do not compare the -AM variant against other methods.
The results are shown in Figure 2. As can be seen, DILOG outperforms other models on the overall test set, which shows DILOG is capable of doing one-shot learning and zero-shot domain transfer even under noisy data. Compared with the MLP policy trained with all the 526 samples in the restaurant domain (denoted as -526), DILOG trained with one sample has a higher inform/success rate but lower action F1 score. Noticeably, DILOG with the addition of missing slots in NLG achieves 91.40 and 90.20 inform and success rate overall, which is higher than the state-of-the-art DAMD. 4 This shows inform and success are incomplete metrics which do not penalize false positives. Hence, we add action F1 as an additional metric to complement inform and success. Note that action F1 cannot solely capture the performance either, since there may be multiple possible true actions in a given state.

Conclusion
In this paper, we introduce DILOG for resource constrained dialog policy learning and zero-shot domain transfer. Empirically, we demonstrate that DILOG outperforms strong neural baselines on SimDial and MultiWoZ datasets, while offering interpretability. We also provide an intuitive explanation on why DILOG shows these features. On the other hand, the DILOG framework has certain weaknesses, one being that it is computationally expensive when the template space gets large, where distributed training will be desired. Further, the program template used to generate all possible clauses needs to be handcrafted, which is not a straightforward process. Another disadvantage is that real-valued inputs, such as confidence scores cannot be taken into account automatically. The future work would naturally focus on solving these shortcomings. One way might be to jointly, in a multi-task setting, predict the templates or meta-learn them (Minervini et al., 2020).

A Authoring the Program Template
A program template needs to be given to a ILP system, in order for it to generate the suitable rules. It is a general part of a system that does not need to be designed for each domain. Manual tuning is often required to get the desired program template. For example, a rule template is one of the important hyperparameters in the program template. A rule template is defined by the number of existentially quantified variables v ∈ N and whether to allow intensional predicates i ∈ {0, 1}. An existentially quantified variable is a variable which is interpreted as "there exists ...". An intensional predicate is a predicate defined by a set of clauses. Intuitively, the larger v, the more candidate rules will be generated, and thus easier for the model to overfit. Similarly, allowing intensional predicates will increase the number of rules generated. The hyperparameters can then be manually tuned to acheive the best performance. Alternatively, we can define the complexity of the rule template to be the number of candidate clauses generated under this rule template. One can order all the possible rule templates by their complexities, and do iterative search starting from the least complex rule template. It is also possible to perform a blackbox optimization on the hyperparameters in order to save the computation.

B Learning the Relationship of all
The following example is used to learn the all relationship, which stands for all the items in a list satisfy some property.
The background knowledge B is The list is presented in a format of linked nodes, where succ(A, B) means the successor of A is B, terminal(A) means A is a terminal state, and true(A) specifies that the property under consideration holds for item A. It was converted to the following grounding atoms: Note that we use a linked list to enumerate the user slots, where succ(A, B) means the successor of A is B, and terminal(A) means A is a terminal state.

C.2 Domain Transfer
The model induced from the restaurant domain is tested on the other three domains of movie, weather and bus. The user slots and system goal slots can be different. can be mapped to the positive example of sys request(country).

C.3 Results
The evaluation results on the SimDial dataset for different domains are shown in

C.4 Learned rules
The rules learned by DILP are simple and can be extracted and interpreted by humans. In particular, DILP converges to these set of rules with probabilities close to one except for the predicate of sys inform: where pred2 and pred3 are invented predicates used as intermediate states.
As an example, the first rule reads: If a slot is one of the user slots, and that slot is unknown, the system should request that slot.
The construction of the sample (B, P, N , C) is similar to that of SimDial. Note that during training, we ignored all the domain information. During inference time, we separate the belief state by domains, and run inference on each domain using the same model. Finally, the predicted actions for each domain are combined to yield the final action prediction.

D.1 Results
The evaluation results for different domains on the MultiWoZ dataset are shown in Table A3. The performance of different domains are similar except train and taxi, this is because the goals in those two domains are less diverse. DAMD applied its own heuristics in data preprocessing, so the Act. F1 of DAMD is not directly comparable with others. We also included DAMD with the same datapreprocessing as the others (denoted as DAMD'), whose performance is worse.