Actor-Double-Critic: Incorporating Model-Based Critic for Task-Oriented Dialogue Systems

In order to improve the sample-efficiency of deep reinforcement learning (DRL), we implemented imagination augmented agent (I2A) in spoken dialogue systems (SDS). Although I2A achieves a higher success rate than baselines by augmenting predicted future into a policy network, its complicated architecture introduces unwanted instability. In this work, we propose actor-double-critic (ADC) to improve the stability and overall performance of I2A. ADC simplifies the architecture of I2A to reduce excessive parameters and hyper-parameters. More importantly, a separate model-based critic shares parameters between actions and makes back-propagation explicit. In our experiments on Cambridge Restaurant Booking task, ADC enhances success rates considerably and shows robustness to imperfect environment models. In addition, ADC exhibits the stability and sample-efficiency as significantly reducing the baseline standard deviation of success rates and reaching the 80% success rate with half training data.


Introduction
Spoken Dialogue Systems (SDS) enable humancomputer interaction via natural language. The core of SDS, dialogue management, can be formulated as an RL problem (Levin et al., 1997;Young et al., 2013;Williams, 2008). Great advancements can be achieved with deep RL algorithms (Dhingra et al., 2016;Chang et al., 2017;Takanobu et al., 2019;Wu et al., 2020). Yet, deep RL methods are notoriously expensive in terms of the number of interactions they require. Even relatively simple tasks can require thousands of labelled dialogues and modelling complex behaviour such as a multi-domain application might need substantially more (Gašić et al., 2011;Su et al., 2016).
Model-based reinforcement learning (MBRL) is one way of improving sample-efficiency in RL (Tamar et al., 2016;Silver et al., 2016;Gu et al., 2016;Nagabandi et al., 2018;Oh et al., 2017). By learning the environment model, we can predict the future states after taking a certain action. In a dialogue system, that means the system can predict the user's behaviour. In contrast, the model-free RL algorithms only learn the mapping of belief states and Q-values and do not make use of the user behaviour patterns in the training data. In other words, model-free RL is wasting actions by going through similar transitions multiple times to get accurate return estimations. Dyna-Q (Sutton, 1990;Sutton et al., 2012) has achieved some success in SDS (Peng et al., 2018;Su et al., 2018;Wu et al., 2019;Zhang et al., 2019) by generating training data for agents and keeping improving its environment model from real interactions between agents and users. Nevertheless, the noisy data generated by inaccurate environment models could adversely affect the experience replay buffer and result in convergence toward suboptimal performance. This problem is even more critical in real-world tasks such as real-world dialogue systems where training a good environment model is challenging.
I2A (Weber et al., 2017) addresses this problem by augmenting model-based information into the input of policy networks in order to filter out the noise generated by poor environment models. However, I2A introduces unwanted instability when we applied it to a dialogue system due to its complex architecture and excessive hyper-parameters. The unstable performance makes it even harder to tune the parameters.
In this paper, we propose Actor-Double-Critic (ADC), a new architecture to augment model-based information into the policy network. By training two critics from model-free and model-based data Figure 1: ADC architecture: Green blocks indicate predicted belief states. a) the environment model predicts the next time step b t+1,ai conditioned on an action a i . b) the actor outputs the policy pi as in a standard actorcritic architecture. c) the two critics estimate Q-values based on the current belief state and predicted next belief states respectively. Final Q-values are the weighted sum of the outputs of two critics. Note that model-based critic predicts i-th Q-value based on b t+1,ai , so this process is repeated for all actions a i ∈ A to obtain all of the Q-values. separately and combining them in an ensemble, we reduce the number of redundant parameters and make back-propagation more efficient. In the Cambridge Restaurant dialogue system task, experimental results show a substantial improvement in success rates. Regarding sample efficiency, ADC takes only half of baseline training data to achieve the 80% success rate. In addition, ADC is the most stable approach among all considered baselines. Compared to a model-free actor-critic algorithm, ACER (Wang et al., 2016), it reduces the standard deviation of success rates from 7.7 to 1.2. It also proves more stable than a Bayesian model-free algorithm GP-SARSA (Gašić et al., 2010).

Dialogue management through reinforcement learning
Dialogue management can be cast as a continuous MDP (Young et al., 2013) composed of a continuous multivariate belief state space B, a finite set of actions A and a reward function R(b t , a t ). The belief state b is a probability distribution over all possible (discrete) states. At a given time t, the agent (policy) observes the belief state b t ∈ B and executes an action a t ∈ A. The agent then receives a reward r t ∈ R drawn from R(b t , a t ). The policy π is defined as a function π : B × A → [0, 1] that with probability π(b, a) takes an action a in a state b. For any policy π and b ∈ B, the value function V π corresponding to π is defined as: where 0 ≤ γ ≤ 1, is a discount factor and r t is a one-step reward. The objective of reinforcement learning is to find an optimal policy π * , i.e. a policy that maximizes the value function in each belief state. Equivalently, the goal is to find an optimal policy π * that maximises the discounted total return over a dialogue with T turns, where r t (b t , a t ) is the reward when taking action a t in dialogue state b t at turn t and γ is the discount factor.
3 Imagination Augmented Agent (I2A) I2A (Weber et al., 2017) manages to implicitly incorporate all the possible future information into the policy network. Basically, it can be divided into three hierarchies: • Imagination core. An environment model is trained on future states and rewards prediction conditioned on an action. By interacting with a baseline actor, the environment model is used to simulate potential trajectories.
• Single imagination roll-out. To efficiently use these simulated trajectories, the agent learns an encoder that extracts information from these imaginations including both states and rewards. The encoder is designed to select useful information and ignore the noisy one generated by imperfect models.
• Augmentative architecture. For each possible action, the simulated trajectories are generated. All the information extracted from trajectories are concatenated together and provided as additional context to a policy network.
However, we found that I2A's hierarchical architecture is not stable enough when experimented on SDS tasks. This architecture contains several fragile components which have a strong impact on the performance, such as the environment model and the roll-out policy network. Excessive hyperparameters, like rollout-depth and embedded feature sizes for the encoder, also make it hard to conduct parameter tuning and apply I2A to real-world applications.

Actor-Double-Critic (ADC)
To increase the stability of the augmenting-style approaches, we simplify the previous architecture and propose a key component -model-based critic. As illustrated in Figure 1, we train two critics based on model-free and model-based information respectively and combine their outputs by the weighted sum in an ensemble.
In this section, we explain why we simplify the architecture in these ways and the benefits of using a model-based critic.

Simplified architecture
To reduce the model complexity, we simplify the architecture in the following three ways, • Our environment model predicts only the next belief state b t+1,a i conditioned on an action a i : the model does not predict rewards. That is because the reward signals in SDS domain are sparse and hard to predict.
• In I2A, the pre-trained environment model will not be updated while learning policy since the policy network is robust to imperfect model. Besides, obtaining pre-training data is not challenging in a simulated game. However, in the real world, pre-training data for SDS is hard to collect. In our approach, in order to improve the sample efficiency, the environment model is updated during policy learning.
• We discard the roll-out policy network.
Since the policy always changes, the predicted action sequences change as well. Since we aim at reducing the uncertainties in our framework, roll-out length is set to 1 without using the roll-out policy network.

Model-based critic
By definition, a Q-value can be decomposed as: In dialogue system tasks, r t is typically set to −1 for each turn to penalize lengthy dialogue in our experimental setting. At the end of a dialogue, r t varies depending on the result yet we do not need to predict Q-values at that time. Hence, r t is a constant in Equation 3 for dialogue system tasks. Given that r t and γ are constants, we can train an estimator for Q π i (b t ) based on the next belief state b t+1,a i , which is predicted by the environment model. 1 We call this estimator model-based critic in the actor-critic framework, while the original one is a model-free critic. Compared to previous approaches, adopting the model-based critic has the following three benefits:

Parameter sharing
Note that given b t+1,a i , the model-based critic of ADC predicts only one value Q i . To obtain all of the Q-values, we firstly predict the next belief states b t+1,a i ∀a i ∈ A using the environment model, and then map each of them to Q i by the model-based critic. Parameters of the model-based critic are shared between actions and the model-complexity is reduced.
In I2A, b t+1,a i ∀a i ∈ A are concatenated as a large input vector. This means the the number of parameters of the model-based path of I2A is increasing with the number of actions, which is not the case in ADC. In practice, the number of parameters in I2A (1.4 millions) is around five times more than ADC (240 thousands).

No redundant connections
As shown in Equation 3, Q i is not relevant to other predicted belief state b t+1,a j where i = j. Q i results from the predicted belief state b t+1,a i . But I2A concatenates all of the predicted belief states and the current belief state together to make the prediction of Q-value. That is, most of the connections in I2A should be updated to zero weights after training. Using model-based critic eliminates these redundant connections and predicts one Q i at one time to improve the stability of the algorithm.

Explicit update signals
We can also predict Q π (b t ) through the model-free critic. The final Q-values are the weighted sum of both two critics in an ensemble way to lower the variance.
where Q π M F (b t , a i ) is the output of the modelfree critic and Q π M B (b t+1,a i ) is the output of the model-based one, and w is a weight parameter. We replace their notation with Q π M F and Q π M B to keep the expressions succinct. The model selects information either from the model-free path (when w = 1) when the model is noisy or from the modelbased path (when w = 0) when it provides more accurate information. During the training process, we compute the loss for each critic and w is a hyperparameter.
where Q ret is the target of Q π using the Retrace algorithm .
Note that for each training iteration, we update two critics at the same time. In I2A, we cannot identify whether errors are coming from model-based path or model-free path. In our approach, the information flows from two sources clearly instead of an ambiguous one. We have tried to back-propagate loss from Q π through the whole network, but the result is better if we back-propagate the loss defined in equation 5. This result again proves the necessity of using two-critics architecture.

Setup
Experiments are conducted on the Cambridge restaurant domain from the PyDial toolkit  with a goal-driven user simulator operating on the semantic level (Schatzmann et al., 2007;Schatzmann and Young, 2009), a LSTMbased NLU model (Mrkšić et al., 2016), and a NLG model (Wen et al., 2015). During training, an agent is updated when a dialogue terminates, which is an iteration. Every 200 training dialogues, the agent is tested on 500 dialogues. 10 random seeds were run  for each approach to analyze the variance arising from different initialization. The mean ± standard deviation is depicted as the shaded area in Figure 2, 3. The x-axes of Figure 2, 3 are in log scale to put emphasis on both the early stage and the final performance of the training process.
User simulator. To accommodate for ASR error, 15% semantic error rate (SER) is included in the user simulator. The maximum dialogue length is set to 25 turns and γ was 0.99. The reward is defined as 20 for a successful dialogue minus the number of turns it took to complete the dialogue. Implementation details. The input for all models is the full dialogue belief state b of size 268 and the output action space consists of 16 possible actions. For NN-based algorithms, the size of a mini-batch is 64. -greedy exploration is used, with linearly reducing from 0.3 down to 0 over the training process. Two hidden layers are of size 300 and 100 for actor and critic. The Adam optimiser was used with an initial learning rate of 0.001 (Kingma and Ba, 2014). For algorithms employing experience replay, the replay memory has a capacity of 2000 interactions.

Dialogue agents for comparison
• GP-SARSA is a Bayesian baseline, which provides a stable performance by utilising uncertainty estimates.
• ACER is the model-free actor-critic baseline and can be perceived as a model-free counterpart of the proposed method. According to the benchmark results , it performs better than other actor-critic methods such as A2C (Fatemi et al., 2016) and eNAC . Since ADC can be applied to any model-free actor-critic method, not all the performance of RL algorithms are reported here. In this paper, we focus on the gap between ACER and ADC rather than the absolute performance. To have a fair comparison, the pre-training data used by model-based  Table 2: Final performance of each agent after training with 4000 dialogues. Tested in 10 runs, each algorithm reports 1) the average success rate 2) the standard deviation of success rates and 3)the average amount of data required to reach the 80% success rate. The latter two matrices are used to evaluate the stability and sampleefficiency respectively.
approaches were put into the experience buffer of ACER at the beginning of the training.
• I2A is the model-based baseline. The environment model is pre-trained with 400 dialogues generated by interactions between a simulated user and an agent.
• ADC is the proposed method. The ensemble weight w is 0.5 for each critic. The environment model setting is the same as I2A.

Comparison with baselines
Success rate. As shown in the left part of Figure 2 and Table 2, ADC outperforms other methods considerably in terms of sample-efficiency, stability, and success rate. I2A performs better than ACER but is still fragile to the initialization, shown as the shaded areas. Compared to I2A, ADC reduces half of the standard deviation of final success rates, from 2.3 to 1.2 In contrast, GP-SARSA is quite stable due to its Bayesian nature. While the standard deviation of the final success rate of I2A is smaller than GP-SARSA, I2A is more unstable in the early stage of the training process. It is worth noticing that ADC is even more stable than GP-SARSA, and reach higher performance in the end. In terms of sample efficiency, ADC uses only half of the data (600 dialogues) to reach the 80% success rate, compared to ACER (1200 dialogues).
Average turns per dialogue As shown in the right part of Figure 2, GP-SARSA takes more turns than other algorithms, and only decrease slightly during training. We found that GP-SARSA tends to take more turns to confirm user intention to stabilize its performance, while some of these confirma-tions are not necessary. Other approaches steadily reduce the number of turns during the process of training.

Different back-propagation styles
In the left part of Figure 3, the red line is the learning curve of the agent that back-propagates only one loss from the ensemble output Q, while the brown line is the agent that update each critic separately and the loss back-propagate from ensemble output only pass through ensemble weight w.
We can note that the agent with the separate loss function (as in equation 5) is more stable than the other method. This is because when the ensemble Q closes to Q ret , Q M F and Q M B are not necessarily close to the target Q ret . In contrast, the separate update can make sure each of output value is accurate.

Robustness to imperfect models
In order to examine the impact of the environment model on ADC, we propose another baseline, actormodel-based-critic (AMC). AMC only use modelbased critic to predict Q-value without the modelfree critic, so the quality of environment model is critical to AMC. In the experiment, a good environment model is pre-trained with 400 dialogues, and a poor environment model is pre-trained with only 200 dialogues.
In the right part of Figure 3, we can observe that ADC maintains its performance with poor model, while AMC's performance drops a lot. This might be because a poor environment model cannot lead to accurate value-prediction. The aid from a modelfree critic is also substantial.

Comparison in different environment settings
To further investigate the properties of ADC, we test it on 6 different environments (simulated user) settings. For each setting, we report the final performance of each agent after training it with 4000 dialogues. Semantic error rate (SER) models the noise from the ASR and NLU channel (Thomson et al., 2012). In addition to the standard user, an unfriendly one is defined, where the user barely provides any extra information to the system. The action masking mechanism is used in environment 1 & 3 to reduce the action space. The setting of each simulated user is listed in Table 3. The results are shown in Table 4. In clean environments (1 & 3), ACER learns well after 4000   dialogues. Yet, in noisy environments (2 & 4), ADC outperforms ACER significantly. In environment 5, an unfriendly user was used. But this defect does not affect the algorithms a lot as action mask is used, so the number of available actions are reduced and therefore the task is less difficult. It is worthy to note that in environment 6, ADC outperforms hand-crafted policy (89.6% ) and demonstrates the flexibility of reinforcement learning that can learn from environments. Overall, ADC demonstrates its robustness in all environments especially for the environments without action masks.

Case study
To further investigate the behaviour of different agents during the training process, we sampled a dialogue session in environment 4 (the setting of the environment is mentioned in section 4.6) after 500 epochs. The agent of ACER and ADC have 57% and 88% success rate respectively. As shown in Table 5, ACER informs the restaurant in the early stage while ADC is more conservative and takes more turns. ADC asks more questions before giving the recommended restaurant and sometimes confirms the booking to make sure the one it provides fulfills all the requirements. Besides, ACER keeps asking the same question and sometimes gives a completely wrong reply. That is because, without the aid of environment model, ACER cannot predict that the next belief state will be the same and thus cannot foresee the unwanted repetitive conversation which leads to the failure of dialogues.

Conclusions
The policy optimisation algorithm presented in this paper provides a model-based augmentation and improves their performance with spoken dialogue systems (SDS). Our contributions are two-fold: 1) We adopted I2A, a model-based reinforcement learning approach, on SDS and demonstrated that it can elaborate rich information generated by environment models. 2) Our proposed algorithm further reduces instability by introducing a simple architecture to augment model-based information into policy network. We used ACER as an actorcritic model-free baseline, but this method can augment any deep actor-critic algorithm.
One interesting topic for future research is model-based actors. In our experiments, incorporating a model-based actor did not work as effectively as ADC. We plan to solve the problems inhibiting in model-based actor and make this algorithm applicable to policy learning approaches (Schulman et al., 2017;Takanobu et al., 2019).