Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation

Dialogue policy optimization often obtains feedback until task completion in task-oriented dialogue systems. This is insufficient for training intermediate dialogue turns since supervision signals (or rewards) are only provided at the end of dialogues. To address this issue, reward learning has been introduced to learn from state-action pairs of an optimal policy to provide turn-by-turn rewards. This approach requires complete state-action annotations of human-to-human dialogues (i.e., expert demonstrations), which is labor intensive. To overcome this limitation, we propose a novel reward learning approach for semi-supervised policy learning. The proposed approach learns a dynamics model as the reward function which models dialogue progress (i.e., state-action sequences) based on expert demonstrations, either with or without annotations. The dynamics model computes rewards by predicting whether the dialogue progress is consistent with expert demonstrations. We further propose to learn action embeddings for a better generalization of the reward function. The proposed approach outperforms competitive policy learning baselines on MultiWOZ, a benchmark multi-domain dataset.


Introduction
Task-oriented dialogue systems complete tasks for users, such as making a restaurant reservation or finding attractions to visit, in multi-turn dialogues (Gao et al., 2018;Sun et al., 2016Sun et al., , 2017)).Dialogue policy is a critical component in both the conventional pipeline approach (Young et al., 2013) and recent end-to-end approaches (Zhao et al., 2019).It decides the next action that a dialogue system should take at each turn.Considering its nature of sequential decision making, dialogue policy is usually learned via reinforcement learning (Su et al., Peng et al., 2018;Zhang et al., 2019).Specifically, dialogue policy is learned by maximizing accumulated rewards over interactions with an environment (i.e., actual users or a user simulator).Handcrafted rewards are commonly used for policy learning in earlier work (Peng et al., 2018), which assigns a small negative penalty at each turn and a large positive/negative reward when the task is successful/failed.However, such reward setting does not provide sufficient supervision signals in each turn other than the last turn, which causes the sparse reward issues and may result in poorly learned policies (Takanobu et al., 2019).
To address this problem, reward function learning that relies on expert demonstrations has been introduced (Takanobu et al., 2019;Li et al., 2019b).Specifically, state-action sequences generated by an optimal policy (i.e., expert demonstrations) are collected, and a reward function is learned to give high rewards to state-action pairs that better resemble the behaviors of the optimal policy.In this way, turn-by-turn rewards estimated by the reward function can be provided to learn dialogue policy.Obtaining expert demonstrations is critical to reward function learning.Since it is impractical to assume that an optimal policy is always available, a common and reasonable approach is to treat the decision makings in human-human dialogues as optimal behaviors.To accommodate the learning of reward function, human-human dialogues need to be annotated in the form of state-action pairs from textual utterances.Table 1 illustrates an example of human-human dialogue and its state-action annotation.However, obtaining such annotations require extensive efforts and costs.Besides, a reward function based on state-action pair might cause an unstable policy learning, especially with a limited amount of annotated dialogues (Yang et al., 2018).
To address the above issues, we propose to learn dialogue policies in a semi-supervised setting where the system action of expert demonstrations only need to be partially annotated.We propose to use an implicitly trained stochastic dynamics model as the reward function to replace the conventional reward function that is restricted to stateaction pairs.Dynamics models describe sequential progress using a combination of stochastic and deterministic states in a latent space, which promotes an effective tracking and forecasting (Minderer et al., 2019;Sun et al., 2019;Wang et al., 2019a).In our scenario, we train the dynamics model to describe dialogue progress of expert demonstrations.The main rationale is that the reward function should give high rewards to actions that lead to dialogue progress similar to those in expert demonstrations.This is because dialogue progress at the early stage highly influences subsequent progress, and the latter directly determines whether the task can be completed.Since the learning of dynamics model maps observations to latent states and further reason over the latent states, we are no longer restricted to fully annotated dialogues.Using dynamics model as reward function also promotes a more stable policy learning.
Learning the dynamics model in the text space is, however, prone to compounding errors due to complexities and diversities of languages.We tackle this challenge by learning the dynamics model in an action embedding space that encodes the effect of system utterances on dialogue progress.We achieve action embedding learning by incorporating an embedding function into a generative models framework for semi-supervised learning (Kingma et al., 2014).We observe that system utterances with comparable effects on dialogue progress will lead to similar state transitions (Huang et al., 2019a).Therefore, we formulate the generative model to describe the state transition process.Using the generative model, we enrich the expert dialogues (either fully or partially annotated) with action embedding to learn the dynamics model.Moreover, we also consider the scenarios where both state and action annotations are absent in most expert dialogues, referred to as unlabeled dialogues.
To expand the proposed approach to such scenarios, we further propose to model dialogue progress using action sequences and reformulate the generative model accordingly.
Our contributions are summarized as follows: • To the best of our knowledge, we are the first to approach semi-supervised dialogue policy learning.
• We propose a novel reward estimation approach to dialogue policy learning which relives the requirements of extensive annotations and promotes a stable learning of dialogue policy.
• We propose an action embedding learning technique to effectively train the reward estimator from either partially labeled or unlabeled dialogues.
• We conduct extensive experiments on the benchmark multi-domain dataset.Results show that our approach consistently outperforms strong baselines coupled with semi-supervised learning techniques.

Preliminaries
For task-oriented dialogues, a dialogue policy π(a|s) decides an action a ∈ A based on the dialogue state s ∈ S at each turn, where A and S are the predefined sets of all actions and states, respectively.Reinforcement learning is commonly applied to dialogue policy learning, where the dialogue policy model is trained to maximize accumulative rewards through interactions with environments (i.e., users): where τ i = {(s t , a t )|0 ≤ t ≤ n τ } represents a sampled dialogue, and r(τ i ) is the numerical rewards obtained in this dialogue.Instead of determining r(τ i ) via heuristics, recent reward learning approaches train a reward function r θ to assign numerical rewards for each state-action pair.The reward function is learned from expert demonstrations D demo that are dialogues sampled from an optimal policy in the form of state-action pairs.Adversarial learning is usually adopted to enforces higher rewards to state-action pairs from expert demonstrations and lower rewards to those sam-Figure 1: Overall framework of the proposed approach pled from the learning policy (Fu et al., 2017): where π is the current dialogue policy, and q is the distribution of dialogues generated with π.In this way, the dialogue policy and reward function are iteratively optimized, which requires great training efforts and might lead to unstable learning results (Yang et al., 2018).Moreover, such a reward learning approach requires a complete dialogue state and system action annotation of expert demonstrations, which are expensive to obtain.

Overview
We study the problem of semi-supervised dialogue policy learning.Specifically, we consider the setting that expert demonstrations D demo consist of a small number of fully labeled dialogues D F and partially labeled dialogues D P .For each fully annotated dialogue τ i in D F , complete annotations are available: where u t is the system utterance at turn t.Meanwhile, each partially labeled dialogue τ j in D P only has state annotations and system utterances: Figure 1 illustrates the overall framework of the proposed approach.Rewards are estimated by a dynamics model that consumes action embeddings e(a t ).Every action in the set A is mapped to a fix-length embedding via a learnable embedding function f E .To obtain the action embeddings for D P which has no action annotations, we first predict the action via a prediction model f A and then transform the predicted actions to embeddings.To obtain effective action embeddings, we design a state-transition based objective to jointly optimize f E and f A via variational inference (Sec.3.2).After obtaining the action embeddings, the dynamics model is learned by fitting the expert demonstrations enriched by action embeddings.Rewards are then estimated as the conditional probability of the action given the current dialogue progress encoded in latent states (Sec.3.3).We also extend the above approach to unlabeled dialogues where both state and action annotations are absent (Sec.3.4).

Action Learning via Generative Models
We aim to learn the prediction model f A and action embeddings using both D F and D P .We formulate the action prediction model as f A (a|u t , s t , s t+1 ) which takes as input the system utterance u t and its corresponding state transition (s t , s t+1 ).We then introduce an mapping function: f E : A → E, where E ⊆ R d is the action embedding space later used for learning the dynamics model.
We train the prediction model by proposing a variational inference approach based on a semisupervised variational autoencoder (Semi-VAE) (Kingma et al., 2014).Semi-VAE describes the data generation process of feature-label pairs {(x i , y i )|1 ≤ i ≤ N } via latent variables z as: where p θ is a generative model parameterised by θ, and the class label y is treated as a latent variables for unlabeled data.Since this log-likelihood in Eqn. 3 is intractable, its variational lower bound for unlabeled data is instead optimized as: where q φ (z|x, y) and q ψ (y|x) are inference models for latent variable z and y respectively, which have a factorised form q φ,ψ (y, z|x) = q φ (z|x, y)q ψ (y|x); H(•) denotes causal entropy; L(x, y) is the variational bound for labeled data, ans is formulated as: where KL is the Kullback-Leibler divergence, and p(y), p(z) are the prior distribution of y, z.
The generative model p θ , inference model q φ and q ψ are optimized using both the labeled subset p l and unlabeled subset p u using the objective as:

Semi-Supervised Action Prediction
We now describe the learning of action prediction model f A using semi-supervised expert demonstrations.We extend the semi-supervised VAE by modeling the generation process of state transitions.
State transition information is indicative for action prediction and is available in both fully and partially labeled demonstrations.Thus we choose to describe the generation process of state transitions, and the optimization objective is formulated as: For partially labeled dialogues, we treat action labels as latent variables and use the action prediction model f A (a|u t , s t , s t+1 ) to infer the value (which is denoted as f A (a|•) later for simplicity).The variational bound of Eqn. 7 is derived as: where L(s t+1 , s t , a t ) is the variational bound for demonstrations with action labels and is derived as: where q φ (z|u t , a) is the inference model for latent variable z.Lastly, we use fully annotated samples to form a classification loss: The overall objectives includes the loss of fully and partially labeled demonstrations:

Action Embeddings Learning
We then incorporate action embedding function f E into the developed semi-supervised action prediction approach.The reason to introduce action embeddings is to make the learning of reward estimator more efficient and robust.Specifically, prediction error of the action prediction model might impinge the learning of reward estimator, especially for our semi-supervised scenarios where fully labeled dialogues are limited.By mapping actions to an embedding space, 'wrongly predicted' partially labeled demonstrations can still provide sufficient knowledge and thus we could achieve better generalization over actions for reward estimation.
To this aim, we consider the inference steps in the semi-supervised learning process and utilize the ones that involve action labels, i.e., the inference models for latent variables z and a.We first specify how the action prediction model is modified to include action embeddings.Inspired by (Chandak et al., 2019), we model the action selection using Boltzmann distribution for stability during training: where γ is a temperature parameter, and g(•) is a function that maps the input into hidden states of the same dimension as action embeddings.We also modify the inference model for latent variable by incorporating action embeddings: After optimizing the action prediction model f A and action embedding function f E jointly using the objective function Eqn.11, we use action embeddings to enrich the expert demonstrations.For fully labeled dialogues, we map the given system action labels to corresponding embeddings and obtain τ i = {(s t , e(a t ))|1 ≤ t ≤ n τ }.For partially labeled dialogues, we first infer the action using prediction model: ãt = f A (u t , s t , s t+1 ), and map the inferred action to its embedding to obtain:

Reward Estimation by Dynamics Model
We aim to learn a reward estimator based on action representations obtained from the action learning module.To achieve a more stable reward estimation than adversarial reward learning, we propose a reward estimator based on dialogues progress.Dialogue progress describes how user goals are achieved through multistep interactions and can be modeled as dialogue state transitions.We argue that an action should be given higher rewards when it leads to similar dialogue progress (i.e., state transitions) of expert demonstrations.To this aim, we learn a model to explicitly model dialogue progress without the negative sampling required by adversarial learning, and rewards can be estimated as the local-probabilities assigned to the taken actions.
To model dialogue progress, we use variational recurrent neural network (VRNN) (Chung et al., 2015).The reason to use a stochastic dynamics model is due to the 'one-to-many' nature of taskoriented dialogues.Specifically, both user and dialogue system have multiple feasible options to proceed the dialogues which requires the modeling of uncertainty.Thus, by adding latent random variables to an RNN architecture, VRNN can provide better modeling of dialogue progress than deterministic dialogue state tracking.
VRNN has three types of variables: the observations (and here we consider action embeddings), the stochastic state z, and the deterministic hidden state h, which summarizes previous stochastic states z ≤t , and previous observations a ≤t .We formulate the prior stochastic states to be conditioned on previous timesteps through hidden state h t−1 : We obtain posterior stochastic states by incorporating the observation at the current step, i,e.action embeddings e(a t ): Predictions are made by decoding latent states, including both the stochastic and deterministic: And lastly the deterministic states are updated as: where ϕ are all implemented as neural networks.
Note that we also make the prediction and recurrence step to condition on the dialogue state s t to provide more information.
We train the VRNN by optimizing the evidence lower bound (ELBO) as: (18) The rewards are estimated as the conditional probability given the hidden state of VRNN, which encodes the current dialogue progress: where p ϕ dec is the probability given to the selected action based on the decoding step of VRNN (Eqn.16).The larger this conditional probability is, the more similar the dialogue progress this action leads to imitates the expert demonstrations.The proposed reward estimation is agnostic to the choice of policy, and various approaches (e.g., Deep Qlearning, Actor-Critic) can be optimized by plugging into the policy learning objective (Eqn.1).

Expanding to Unlabeled Corpus
We further describe how to expand the proposed model, including action learning and reward estimation modules, to utilize unlabeled expert demonstrations.Formally, we consider the setting that we have fully labeled dialogues D F and unlabeled dialogues D U .For each dialogue in D U , only textual conversations are provided and neither of state and action labels are available: , where c t is the context and consists of the dialogue history of both user and system utterances.
With the absence of dialogue state information, we formulate the action prediction model as f A (a|u t , u t−1 , u t+1 ).This formulation can be considered as an application of Skip-Thought (Kiros et al., 2015), which originally utilizes contextual sentences as supervision signals.In our scenarios, we instead utilize the previous and next system utterances to provide more indicative information for action prediction.
We also build the joint learning of action prediction model the action embeddings on semisupervised VAE framework.Instead of modeling state transitions, we choose the process of response generation to fully utilize unlabeled dialogues: System action labels are treated as latent variables for unlabeled dialogues, and the variational bond is derived as: where L(u t , a) is variational bound for fully labeled dialogues: The objective to jointly train the prediction model and action embeddings is the same as Eqn.11, where the terms for fully and partially labeled dialogues are replaced with the ones in Eqn.22 and 21, respectively.Such expanding also enables a sufficient semi-supervised learning when expert demonstrations include all types of labeled dialogues: D F , D P and D U .We notice that the posterior approximation q φ (z|u t , a) and action embedding function f E can be sharing between the process of state transitions and response generation.Thus, by treating semi-supervised learning in D F and D P as auxiliary constraints, the learning over unlabeled corpus can also benefit from dialogues state information.

Experiments
To show the effectiveness of the proposed model (denoted as Act-VRNN), we experiment on a multi-domain dialogue environment under semisupervised setting (Sec.4.1).We compare against state-of-the-art approaches, and their variants enhanced by semi-supervised learning techniques (Sec.4.2).We analyze the effectiveness of action learning and reward estimation of Act-VRNN under different supervision ratios (Sec.4.3).

Settings
We use MultiWOZ (Budzianowski et al., 2018), a multi-domain human-human conversational dataset in our experiments.It contains in total 8438 dialogues spanning over seven domains, and each dialogue has 13.7 turns on average.MultiWOZ also contains a larger dialogue state and action space compared to former datasets such as movie-ticket booking dialogues (Li et al., 2017), and thus it is a much more challenging environment for policy learning.To use MultiWOZ for policy learning, a user simulator that initializes a user goal at the beginning and interacts with dialogue policy is required.For a fair comparison, we adopt the same procedure as Takanobu et al. (2019) to train the user simulator based on auxiliary user action annotations provided by ConvLab (Lee et al., 2019).
To simulate semi-supervised policy learning, we remove system action and dialogue states annotations to obtain partially labeled and unlabeled expert demonstrations, respectively.Fully labeled expert demonstrations are randomly sampled from all training dialogues with different ratios (5%, 10%, and 15% in our experiments).Note that the absence of action or state annotations only applies for expert demonstrations, while interactions between policy and user simulator are in dialogue-act level as (Takanobu et al., 2019) and not affected by semi-supervised setting.
We use Entity-F1 and Success Rate to evaluate dialogue task completion.Entity-F1 computes the F1 score based on whether the requested information and indicated constraints from users are satisfied.Compared to inform rate and match rate used by Budzianowski et al. (2018), Entity-F1 considers both informed and requested entities at the same time and balances the recall and precision.Success rate indicates the ratio of successful dialogues, where a dialogue is regarded as successful only if all informed and requested entities are matched of the dialogue.We use Turns to evaluate the cost for task completion, where a lower number indicates the policy performs tasks more efficiently.
We compare Act-VRNN with three policy learning baselines: (1) PPO (Schulman et al., 2017) using hand-crafted rewards setting; (2) ALDM(Liu and Lane, 2018); (3) GDPL (Takanobu et al., 2019); We further consider using semi-supervised techniques to enhance the baselines under semisupervised setting, and denote them as SS-PPO, SS-ALDM, and SS-GDPL.Specifically, we first train a prediction model based on semi-supervised VAE (Kingma et al., 2014), and use the predic- tion results as action annotations for expert demonstrations. 1 We also compare the full model Act-VRNN with its two variants: (1) SS-VRNN uses a VRNN that consumes predicted action labels instead of action embeddings; (2) Act-GDPL feeds expert demonstrations enriched by action embeddings to the same reward function as GDPL

Overall Results
Table 2 shows that our proposed model consistently outperforms other models in the setting that uses fully and partially annotated dialogues (D F and D P ).Act-VRNN improves task completion (measured by Entity-F1 and Success) while requiring less cost (measured by Turns).For example, Act-VRNN (81.8) outperforms SS-GDPL (60.4) by 35.4% under Success when having 10% fully annotated dialogues, and requires the fewest turns.Meanwhile, we find that both action learning and dynamics model are essential to the superiority of Act-VRNN.For example, Act-VRNN achieves 19.8% and 11.2% improvements over SS-VRNN and Act-GDPL, respectively, under Success when having 20% fully annotated dialogues.This validates that the learned action embeddings well capture similarities among actions, and VRNN is able to exploit such similarities for reward estimation.
We further find that the improvements brought by semi-VAE enhancement is limited for baselines, especially when the ratio of fully annotated dialogues is low.For example, SS-PPO and SS-GDPL achieve 6% and 7% improvements over their counterparts under Success when having 5% fully annotated dialogues.Similar results are also observed for pseudo-label approach.In general, the pseudo- 1 We also experimented with the pseudo-label approach (Lee, 2013), and the empirical results were worse than Semi-VAE.Thus, we only report the Semi-VAE enhancement results in the table for simplicity.label methods are outperformed by the counterparts of Semi-VAE and are even worse than the baselines without enhancement when the ratio of fully annotated dialogues is low.For example, in setting D F + D P , pseudo-label enhanced PPO performs worse than PPO under Entity-F1 when the ratio of fully annotated dialogues is 5% and 10% (37.2 vs 41.8, 39.2 vs 45.3), and only achieves slightly gain when the ratio is 20% (51.0 vs 50.6).This is largely because the prediction accuracy of Semi-VAE and pseudo-label approach might be low with a small amount of fully annotated dialogues, and the expert dialogues with mispredicted actions impinge reward function learning of baselines.Act-VRNN overcomes this challenge with the generalization ability brought by modeling dialogue progress in an action embedding space for reward estimation.
The results for policy learning using unlabeled dialogues (D U ) are shown on Table 3.We consider two settings: (1) having fully labeled and unlabeled dialogues, i.e., D F + D U ; (2) having all three types of dialogues , i.e., D F + D P +D U .We can see that Act-VRNN significantly outperforms the baselines in both settings.For example, in setting D F + D U , Act-VRNN outperforms SS-GDPL by 43% and 44% under Entity-F1 and Success, respectively.Similar results are also observed in setting D F + D P +D U .We further find that SS-VRNN outperforms Act-GDPL in these two settings while the results are opposite in setting D F + D P , and we will conduct a detailed discussion in the following section.By comparing results of Act-VRNN and baselines in these two settings, we can see that Act-VRNN can better exploit the additional partially labeled dialogues.For example, SS-GDPL only achieves 2.3% under Success while Act-VRNN achieves more than 5%.

Discussions
We first study the effects of action learning module in Act-VRNN.We compare Act-VRNN with SS-VRNN, and their counterparts that do not use state transition based objective in semi-supervised learning (i.e., optimizing Eqn. 3 instead of Eqn. 7).These two variants are denoted as Act-VRNN (no state) and SS-VRNN (no state).For a thorough investigation, under each setting, we further show the performances under dialogues spanning over different number of domains.Dialogues spanning over more domains are considered more difficult.The results under two supervision ratio setting are shown in Fig. 2(a) and Fig. 2(b).We can see that Act-VRNN outperforms other variants in each configuration, especially in the dialogues that include more than one domains.This is largely because the learned action embeddings effectively discover the similarities between actions across domains, and thus lead to better generalization of reward estimation.We further find that the state transition based objective we formulated fits well with the VRNN based reward estimator.Both Act-VRNN and SS-VRNN optimized considering state transitions achieve performance gains.
Last, we study the effects of dynamics model based reward function in Act-VRNN.We consider four different models as reward function: (1) our full dynamics model VRNN; (2) a dynamics model having only deterministic states (Eqn.17 We can see that both stochastic and deterministic states in VRNN are important, since VRNN outperforms its two variants and GDPL in each configuration.We further find that the contribution of stochastic and deterministic states may vary in different setting.For example, VRNN (stochastic only) consistently outperforms VRNN (deterministic only) in D F + D U while opposite results are observed in D F + D P when ratio of D F is over 20%.This is largely because modeling dialogue progress using stochastic states can provide more stable with less supervision signals, while the incorporation of deterministic can lead to more precise estimation can when more information of expert demonstrations are available.

Related Work
Reward learning aims to provide more effective and sufficient supervision signals for dialogue policy.Early studies focus on learning reward function utilizing external evaluations, e.g., user experience feedbacks (Gašić et al., 2013), objective ratings (Su et al., 2015;Ultes et al., 2017), or a combination of multiple evaluations (Su et al., 2016;Chen et al., 2019).These approaches often assume a human-in-the-loop setting where interactions with real users are available during training, which is expensive and difficult to scale.As more largescale high-quality dialogue corpus become available (e.g., MultiWOZ (Budzianowski et al., 2018)), recent years have seen a growing interest in learning reward function from expert demonstrations.Most recent approaches apply inverse reinforcement learning techniques for dialogue policy learning (Takanobu et al., 2019;Li et al., 2019b).These all require a complete state-action annotation for expert demonstrations.We aim to overcome this limitation in this study.

Conclusions
We study the problem of semi-supervised policy learning and propose Act-VRNN to provide more effective and stable rewards estimations.We formulate a generative model to jointly infer action labels and learn action embeddings.We design a novel reward function to first model dialogue progress, and estimate action rewards by determining whether the action leads to similar progress as expert dialogues.The experimental results confirm that Act-VRNN achieves better task completion compared with the state-of-the-art in two settings that consider partially labeled or unlabeled dialogues.For future work, we will explore the scenarios that annotations are absent for all expert dialogues.

Figure 2 :Figure 3 :
Figure 2: Effects of action learning (D F and D P )

Table 1 :
State Action Annotation and Utterance Example

Table 2 :
Semi-Supervised Policy Learning Results (D F and D P )

Table 3 :
Semi-Supervised Policy Learning Results (D F , D P , and D U ) Note that PPO and GDPL achieve the same results as DF (10%)+DP (90%) in Table 2 since they can only utilize dialogues in DF *