Modeling Multi-Action Policy for Task-Oriented Dialogues

Dialogue management (DM) plays a key role in the quality of the interaction with the user in a task-oriented dialogue system. In most existing approaches, the agent predicts only one DM policy action per turn. This significantly limits the expressive power of the conversational agent and introduces unwanted turns of interactions that may challenge users’ patience. Longer conversations also lead to more errors and the system needs to be more robust to handle them. In this paper, we compare the performance of several models on the task of predicting multiple acts for each turn. A novel policy model is proposed based on a recurrent cell called gated Continue-Act-Slots (gCAS) that overcomes the limitations of the existing models. Experimental results show that gCAS outperforms other approaches. The datasets and code are available at https://leishu02.github.io/.


Introduction
In a task-oriented dialogue system, the dialogue manager policy module predicts actions usually in terms of dialogue acts and domain specific slots. It is a crucial component that influences the efficiency (e.g., the conciseness and smoothness) of the communication between the user and the agent. Both supervised learning (SL) (Stent, 2002;Williams et al., 2017a;Williams and Zweig, 2016;Henderson et al., 2005Henderson et al., , 2008 and reinforcement learning (RL) approaches (Walker, 2000;Young et al., 2007;Gasic and Young, 2014;Williams et al., 2017b;Su et al., 2017) have been adopted to learn policies. SL learns a policy to predict acts given the dialogue state. Recent work (Wen et al., 2017;Liu and Lane, 2018) also used SL as pre-training for RL to mitigate the sample inefficiency of RL approaches and to reduce the number of interactions. Sequence2Sequence  (Seq2Seq) (Sutskever et al., 2014) approaches have also been adopted in user simulators to produce user acts (Gur et al., 2018). These approaches typically assume that the agent can only produce one act per turn through classification. Generating only one act per turn significantly limits what an agent can do in a turn and leads to lengthy dialogues, making tracking of state and context throughout the dialogue harder. An example in Table 1 shows how the agent can produce both an inform and a multiple choice act, reducing the need for additional turns. The use of multiple actions has previously been used in interaction managers that keep track of the floor (who is speaking right now) (Raux and Eskénazi, 2007;Khouzaimi et al., 2015;Hastie et al., 2013, among others), but the option of generating multiple acts simultaneously at each turn for dialogue policy has been largely ignored, and only explored in simulated scenarios without real data (Chandramohan and Pietquin, 2010).
This task can be cast as a multi-label classification problem (if the sequential dependency among the acts is ignored) or as a sequence generation one as shown in Table 2.
In this paper, we introduce a novel policy model to output multiple actions per turn (called multiact), generating a sequence of tuples and expanding agents' expressive power. Each tuple is defined as (continue, act, slots), where continue indicates whether to continue or stop producing new acts, act is an act type (e.g., inform or request), and slots is a set of slots (names) associated with the current act type. Correspondingly, a novel decoder  Figure 1: CAS decoder: at each step, a tuple of (continue, act, slots) is produced. The KB vector k regarding the queried result from knowledge base is not shown for brevity.
( Figure 1) is proposed to produce such sequences. Each tuple is generated by a cell called gated Continue Act Slots (gCAS, as in Figure 2), which is composed of three sequentially connected gated units handling the three components of the tuple. This decoder can generate multi-acts in a double recurrent manner (Tay et al., 2018). We compare this model with baseline classifiers and sequence generation models and show that it consistently outperforms them.

Methodology
The proposed policy network adopts an encoderdecoder architecture ( Figure 1). The input to the encoder is the current-turn dialogue state, which follows Li et al. (2018)'s definition. It contains policy actions from the previous turn, user dialogue acts from the current turn, user requested slots, the user informed slots, the agent requested slots and agent proposed slots. We treat the dialogue state as a sequence and adopt a GRU (Cho et al., 2014) to encode it. The encoded dialogue state is a sequence of vectors E = (e 0 , . . . , e l ) and the last hidden state is h E . The CAS decoder recurrently generates tuples at each step. It takes h E as initial hidden state h 0 . At each decoding step, the input contains the previous (continue, act, slots) tuple (c t−1 , a t−1 , s t−1 ). An additional vector k containing the number of results from the knowledge base (KB) query and the current turn number is given as input. The output of the decoder at each step is a tuple (c, a, s), where c ∈ { continue , stop , pad }, a ∈ A (one act from the act set), and s ⊂ S (a subset from the slot set).

gCAS Cell
As shown in Figure 2, the gated CAS cell contains three sequentially connected units for outputting continue, act, and slots respectively. The Continue unit maps the previous tuple (c t−1 , a t−1 , s t−1 ) and the KB vector k into x c t . The hidden state from the previous step h t−1 and x c t are inputs to a GRU c unit that produces output g c t and hidden state h c t . Finally, g c t is used to predict c t through a linear projection and a softmax. (1) The Act unit maps the tuple (c t , a t−1 , s t−1 ) and the KB vector k into x a t . The hidden state from the continue cell h c t and x a t are inputs to a GRU a unit that produces output g a t and hidden state h a t . Finally, g a t is used to predict a t through a linear projection and a softmax.
The Slots unit maps the tuple (c t , a t , s t−1 ) and the KB vector k into x s t . The hidden state from the act cell h a t and x s t are inputs to a GRU s unit that produces output g s t and hidden state h s t . Finally, g a annotation inform(moviename=The Witch, The Other Side of the Door, The Boy; genre=thriller) multiple choice(moviename) classification inform+moviename, inform+genre, multiple choice+moviename sequence

Experiments
The experiment dataset comes from Microsoft Research (MSR) 2 . It contains three domains: movie, taxi, and restaurant. The total count of dialogues per domain and train/valid/test split is reported in Table 3. At every turn both user and agent acts are annotated, we use only the agent side as targets in our experiment. The acts are ordered in the dataset (each output sentence aligns with one act). The size of the sets of acts, slots, and act-slot pairs are also listed in Table 3. Table 4 shows the count of turns with multiple act annotations, which amounts to 23% of the dataset. We use MSR's dialogue management code and knowledge base to obtain the state at each turn and use it as input to every model.

Evaluation Metrics
We evaluate the performance at the act, frame and task completion level. For a frame to be correct, both the act and all the slots should match the ground truth. We report precision, recall, F 1 score of turn-level acts and frames. For task completion evaluation, Entity F 1 score and Success F 1 score (Lei et al., 2018)   agent with the slots the user informed about and that were used to perform the KB query. We use it to measure agent performance in requesting information. The Success F 1 score compares the slots provided by the agent with the slots requested by the user. We use it to measure the agent performance in providing information.
Critical slots and Non-critical slots: By 'noncritical', we mean slots that the user informs the system about by providing their values and thus it is not critical for the system to provide them in the output. Table 1 shows an example, with the genre slot provided by the user and the system repeating it in its answer. Critical slots refers to slots that the system must provide like moviename in the Table 1 example. Although non-critical slots do not impact task completion directly, they may influence the output quality by enriching the dialogue state and helping users understand the system's utterance correctly. Furthermore, given the same dialog state, utterances offering non-critical slots or not offering them can both be present in the dataset, as they are optional. This makes the prediction of those slots more challenging for the system. To provide a more detailed analysis, we report the precision, recall, F 1 score of turn-level for all slots, critical slots and non-critical slots of the inform act.

Baseline
We compare five methods on the multi-act task.
Classification replicates the MSR challenge (Li et al., 2018) policy network architecture: two fully connected layers. We replace the last activation from softmax to sigmoid in order to predict probabilities for each act-slot pair. It is equivalent to binary classification for each act-slot pair and the  Table 6: Precision (P), Recall (R) and F 1 score (F 1 ) of turn-level acts and frames.  loss is the sum of the binary cross-entropy of all of them.
Seq2Seq (Sutskever et al., 2014) encodes the dialogue state as a sequence, and decodes agent acts as a sequence with attention (Bahdanau et al., 2015).
Copy Seq2Seq (Gu et al., 2016) adds a copy mechanism to Seq2Seq, which allows copying words from the encoder input.
CAS adopts a single GRU (Cho et al., 2014) for decoding and uses three different fully connected layers for mapping the output of the GRU to continue, act and slots. For each step in the sequence of CAS tuples, given the output of the GRU, continue, act and slot predictions are obtained by separate heads, each with one fully connected layer. The hidden state of the GRU and the predictions at the previous step are passed to the cell at the next step connecting them sequentially.
gCAS uses our proposed recurrent cell which contains separate continue, act and slots unit that are sequentially connected.
The classification architecture has two fully connected layers of size 128, and the remaining models have a hidden size of 64 and a teacherforcing rate of 0.5. Seq2Seq and Copy Seq2Seq use a beam search with beam size 10 during inference. CAS and gCAS do not adopt a beam search since their inference steps are much less than Seq2Seq methods. All models use Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001.

Result and Error Analysis
As shown in Table 5, gCAS outperforms all other methods on Entity F 1 in all three domains. Compared to Seq2Seq, the performance advantage of gCAS in the taxi and restaurant domains is small, while it is more evident in the movie domain. The reason is that in the movie domain the proportion of turns with multiple acts is higher (52%), while in the other two domains it is lower (30%). gCAS also outperforms all other models in terms of Success F 1 in the movie and restaurant domain but is outperformed by the classification model in the taxi domain. The reason is that in the taxi domain, the agent usually informs the user at the last turn, while in all previous turns the agent usually requests information from the user. It is easy for the classification model to overfit this pattern. The advantage of gCAS in the restaurant domain is much more evident: the agent's inform act usually has multiple slots (see example 2 in Table 7) and this makes classification and sequence generation harder, but gCAS multi-label slots decoder handles it easily. Table 6 shows the turn-level acts and frame prediction performance. CAS and gCAS outperform all other models in acts prediction in terms of F 1 score. The main reason is that CAS and gCAS output a tuple at each recurrent step, which makes for shorter sequences that are easier to generate compared to the long sequences of Seq2Seq (example 2 in Table 7). The classification method has a good precision score, but a lower recall score, suggesting it has problems making granular decisions (example 2 in Table 7). At the frame level, gCAS still outperforms all other methods. The performance   Table 9: P, R and F 1 of turn-level inform critical slots.
difference between CAS and gCAS on frames becomes much more evident, suggesting that gCAS is more capable of predicting slots that are consistent with the act. This finding is also consistent with their Entity F 1 and Success F 1 performance. However, gCAS's act-slot pair performance is far from perfect. The most common failure case is on non-critical slots (like 'genre' in the example in Table 2): gCAS does not predict them, while it predicts the critical ones (like 'moviename' in the example in Table 2). Table 7 shows predictions of all methods from two emblematic examples. Example 1 is a frequent single-act multi-slots agent act. Example 2 is a complex multi-act example. The baseline classification method can predict frequent pairs in the dataset, but cannot predict any act in the complex example. The generated sequences of Copy Seq2Seq and Seq2Seq show that both models struggle in following the syntax. CAS cannot predict slots correctly even if the act is common in the dataset. gCAS returns a correct prediction for Example 1, but for Example 2 gCAS cannot predict 'starttime', which is a non-critical slot. Tables 8 and 9 show the results of all slots, critical slots and non-critical slots under the inform act. gCAS performs better than the other methods on all slots in the movie and restaurant domains. The reason why classification performs the best here in the taxi domain is the same as the Success F 1 . In the taxi domain, the agent usually informs the user at the last turn. The non-critical slots are also repeated frequently in the taxi domain, which makes their prediction easier. gCAS's performance is close to other methods on critical-slots. The reason is that the inform act is mostly the first act in multi-act and critical slots are usually frequent in the data. All methods can predict them well.
In the movie and restaurant domains, the inform act usually appears during the dialogue and there are many optional non-critical slots that can appear (see Table 3, movie and restaurant domains have more slots and pairs than the taxi domain). gCAS can better predict the non-critical slots than other methods. However, the overall performance on non-critical slots is much worse than critical slots since their appearances are optional and inconsistent in the data.

Conclusion and Future Work
In this paper, we introduced a multi-act dialogue policy model motivated by the need for a richer interaction between users and conversation agents. We studied classification and sequence generation methods for this task, and proposed a novel recurrent cell, gated CAS, which allows the decoder to output a tuple at each step. Experimental results showed that gCAS is the best performing model for multi-act prediction. The CAS decoder and the gCAS cell can also be used in a user simulator and gCAS can be applied in the encoder. A few directions for improvement have also been identified: 1) improving the performance on non-critical slots, 2) tuning the decoder with RL, 3) text generation from gCAS. We leave them as future work.