Multi-Domain Dialogue Acts and Response Co-Generation

Generating fluent and informative responses is of critical importance for task-oriented dialogue systems. Existing pipeline approaches generally predict multiple dialogue acts first and use them to assist response generation. There are at least two shortcomings with such approaches. First, the inherent structures of multi-domain dialogue acts are neglected. Second, the semantic associations between acts and responses are not taken into account for response generation. To address these issues, we propose a neural co-generation model that generates dialogue acts and responses concurrently. Unlike those pipeline approaches, our act generation module preserves the semantic structures of multi-domain dialogue acts and our response generation module dynamically attends to different acts as needed. We train the two modules jointly using an uncertainty loss to adjust their task weights adaptively. Extensive experiments are conducted on the large-scale MultiWOZ dataset and the results show that our model achieves very favorable improvement over several state-of-the-art models in both automatic and human evaluations.


Introduction
Task-oriented dialogue systems aim to facilitate people with such services as hotel reservation and ticket booking through natural language conversations. Recent years have seen a rapid proliferation of interests in this task from both academia and industry (Bordes et al., 2017;Budzianowski et al., 2018;Wu et al., 2019). A standard architecture of these systems generally decomposes this task into several subtasks, including natural language understanding , dialogue state tracking (Zhong et al., 2018) and natural language Figure 1: An example of dialogue from the MultiWOZ dataset, where the dialogue system needs to generate a natural language response according to current belief state and related database records. generation (Su et al., 2018). They can be modeled separately and combined into a pipeline system. Figure 1 shows a dialogue example, from which we can notice that the natural language generation subtask can be further divided into dialogue act prediction and response generation Zhao et al., 2019;. While the former is intended to predict the next action(s) based on current conversational state and database information, response generation is used to produce a natural language response based on the action(s).
In order for dialogues to be natural and effective, responses should be fluent, informative, and relevant. Nevertheless, current sequence-to-sequence models often generate uninformative responses like "I don't know" (Li et al., 2016a), hindering the dialogues to continue or even leading to a failure. Some researchers (Pei et al., 2019;Mehri et al.,  2019) sought to combine multiple decoders into a stronger one to avoid such responses, while others Wen et al., 2015;Zhao et al., 2019; represent dialogue acts in a global, static vector to assist response generation.
As pointed out by , dialogue acts can be naturally organized in hierarchical structures, which has yet to be explored seriously. Take two acts station-request-stars and restaurantinform-address as an example. While the first act rarely appears in real-world dialogues, the second is more often. Moreover, there can be multiple dialogue acts mentioned in a single dialogue turn, which requires the model to attend to different acts for different sub-sequences. Thus, a global vector is unable to capture the inter-relationships among acts, nor is it flexible for response generation especially when more than one act is mentioned.
To overcome the above issues, we treat dialogue act prediction as another sequence generation problem like response generation and propose a co-generation model to generate them concurrently. Unlike those classification approaches, act sequence generation not only preserves the interrelationships among dialogue acts but also allows close interactions with response generation. By attending to different acts, the response generation module can dynamically capture salient acts and produce higher-quality responses. Figure 2 demonstrates the difference between the classification and the generation approaches for act prediction.
As for training, most joint learning models rely on hand-crafted or tunable weights on development sets (Liu and Lane, 2017;Mrkšić et al., 2017;Ras-togi et al., 2018). The challenge here is to combine two sequence generators with varied vocabularies and sequence lengths. The model is sensitive during training and nontrivial to generate an optimal weight. To address this issue, we opt for an uncertainty loss (Kendall et al., 2018) to adaptively adjust the weight according to task-specific uncertainty. We conduct extensive studies on a largescale task-oriented dataset to evaluate the model. The experimental results confirm the effectiveness of our model with very favorable performance over several state-of-the-art methods.
The contributions of this work include: • We model dialogue act prediction as a sequence generation problem that allows to exploit act structures for the prediction.
• We propose a co-generation model to generate act and response sequences jointly, with an uncertainty loss used for adaptive weighting.
• Experiments on MultiWOZ verify that our model outperforms several state-of-the-art methods in automatic and human evaluations.

Related Work
Dialogue act prediction and response generation are closely related in general in the research of dialogue systems Zhao et al., 2019;, where dialogue act prediction is first conducted and used for response generation. Each dialogue act can be treated as a triple (domainaction-slot) and all acts together are represented in a one-hot vector (Wen et al., 2015;Budzianowski et al., 2018). Such sparse representation makes the act space very large. To overcome this issue,  took into account act structures and proposed to represent the dialogue acts with level-specific one-hot vectors. Each dimension of the vectors is predicted by a binary classifier.
To improve response generation, Pei et al. (2019) proposed to learn different expert decoders for different domains and acts, and combined them with a chair decoder. Mehri et al. (2019) applied a cold-fusion method (Sriram et al., 2018) to combine their response decoder with a language model. Zhao et al. (2019) treated dialogue acts as latent variables and used reinforcement learning to optimize them. Reinforcement learning was also applied to find optimal dialogue policies in task-oriented dialogue systems (Su et al., 2017;Williams et al., 2017) or obtain higher dialog-level rewards in chatting (Li et al., 2016b;Serban et al., 2017). Besides,  proposed to predict the acts explicitly with a compact act graph representation and employed hierarchical disentangled self-attention to control response text generation.
Unlike those pipeline architectures, joint learning approaches try to explore the interactions between act prediction and response generation. A large body of research in this direction uses a shared user utterance encoder and train natural language understanding jointly with dialogue state tracking (Mrkšić et al., 2017;Rastogi et al., 2018). Liu and Lane (2017) proposed to train a unified network for two subtasks of dialogue state tracking, i.e., knowledge base operation and response candidate selection. Jiang et al. (2019) showed that joint learning of dialogue act and response benefits representation learning. These works generally demonstrate that joint learning of the subtasks of dialogue systems is able to improve each other and the overall system performance.

Architecture
history in a multi-turn conversational setting, where U i and R i are the i-th user utterance and system response, respectively. D = {d 1 , d 2 , . . . , d n } includes the attributes of related database records for current turn. The objective of a dialogue system is to generate a natural language response R t = y 1 y 2 . . . y n of n words based on the current belief state and database attributes.
In our framework, dialogue acts and response are co-generated based on the transformer encoderdecoder architecture (Vaswani et al., 2017). A standard transformer includes a multi-head attention layer that encodes a value V according to the attention weights from query Q to key K, followed by a position-wise feed-forward network (G f ): In what follows we use F(Q, K, V ) to denote the transformer.
Encoder We use E = Emb([T ; D]) to represent the concatenated word embeddings of dialogue history T and database attributes D. The transformer F(Q, K, V ) is then used to encode E and output its hidden state H e : Decoder At each time step t of response generation, the decoder first computes a self-attention h r t over already-generated words y 1:t−1 : where e r t−1 is the embedding of the (t − 1)-th generated word and e r 1:t−1 is an embedding matrix of e r 1 to e r t−1 . Cross-attention from h r t to dialogue history T is then executed: The resulting vectors of Equations 3 and 4, h r t and c r t , are concatenated and mapped to a distribution of vocabulary size to predict next word:

The MARCO Approach
Based on the above encoder-decoder architecture, our model is designed to consist of three components, namely, a shared encoder, a dialogue act generator, and a response generator. As shown in Figure 3, instead of predicting each act token individually and separately from response generation, our model aims to generate act sequence and response concurrently in a joint model which is optimized by the uncertainty loss (Kendall et al., 2018).

Dialogue Acts Generation
Dialogue acts can be viewed as a semantic plan for response generation. As shown in Figure 2, they can be naturally organized in hierarchical structures, including domain level, action level, and slot level. Most existing methods treat dialogue acts as triples represented in one-hot vectors and predict the vector values with binary classifiers (Wen et al., 2015;Budzianowski et al., 2018). Such representations ignore the inter-relationships and associations among acts, domains, actions and slots. For example, the slot area may appear in more than one domain. Unlike them, we model the prediction of acts as a sequence generation problem, which takes into consideration the structures of acts and generates each act token conditioned on its previouslygenerated tokens. In this approach, different domains are allowed to share common slots and the search space of dialogue act is greatly reduced. The act generation starts from a special token " SOS " and produces dialogue acts A = a 1 a 2 . . . a n sequentially. During training, the act  Figure 3: Architecture of the proposed model for act and response co-generation, where act and response generators share the same encoder. The response generator is allowed to attend to different act hidden states as needed using dynamic act attention. The two generators are trained jointly and optimized by the uncertainty loss.
sequence is organized by domain, action and slot, while items at each level are arranged in dictionary order, where identical items are merged. When decoding each act token, we first represent the current belief state with an embedding vector v b and add it to each act word embedding e a t as: Finally, the decoder of Section 3.2 is used to generate hidden states H a and act tokens accordingly.

Acts and Response Co-Generation
Dialogue acts and responses are closely related in dialogue systems. On one hand, system responses are generated based on dialogue acts. On the other, their shared information can improve each other through joint learning.
Shared Encoder Our dialogue act generator and response generator share one same encoder and input, but having different masking strategies for the input to focus on different information. In particular, only the current utterance is kept for act generation, while the entire history utterances are used for response generation. 1 Dynamic Act Attention A response usually corresponds to more than one dialogue act in multidomain dialogue systems. Nevertheless, existing methods mostly use a static act vector to represent all the acts, and add the vector to each response token representation. They ignore the fact that different subsequences of a response may need to attend to different acts. To address this issue, we compute dynamic act attention o r t from the response to acts when generating a response word: where h r t is the current hidden state produced by Equation 3. Then, we combine o r t and h r t with response-to-history attention c r t (by Equation 4) to estimate the probabilities of next word:

Uncertainty Loss
The cross-entropy function is used to measure the generation losses, L a (θ) and L r (θ), of dialogue acts and responses, respectively: where the ground-truth tokens of acts and response of each turn are represented by A * and Y * , while the predicted tokens by A and Y . To optimize the above functions jointly, a general approach is to compute a weighted sum like: However, dialogue acts and responses vary seriously in sequence length and vocabulary size, making the weight α unstable to tune. Instead, we opt for an uncertainty loss (Kendall et al., 2018) to adjust it adaptively: (12) where σ 1 and σ 2 are two learnable parameters. The advantage of this uncertainty loss is that it models the homoscedastic uncertainty of each task and provides task-dependent weight for multi-task learning (Kendall et al., 2018). Our experiments also confirm that it leads to more stable weighting than the traditional approach (Section 6.3).

Dataset and Metrics
MultiWOZ 2.0 (Budzianowski et al., 2018) is a large-scale multi-domain conversational datatset consisting of thousands of dialogues in seven domains. For fair comparison, we use the same validation set and test set as previous studies Zhao et al., 2019;Budzianowski et al., 2018), each set including 1000 dialogues. 2 We use the Inform Rate and Request Success metrics to evaluate dialog completion, with one measuring whether a system has provided an appropriate entity and the other assessing if it has answered all requested attributes. Besides, we use BLEU (Papineni et al., 2002) to measure the fluency of generated responses. To measure the overall system performance, we compute a combined score: (Inform Rate + Request Success) × 0.5 + BLEU as before (Budzianowski et al., 2018;Mehri et al., 2019;Pei et al., 2019).

Implementation Details
The implementation 3 is on a single Tesla P100 GPU with a batch size of 512. The dimension of 2 There are only five domains (restaurant, hotel, attract, taxi, train) of dialogues in the test set as the other two (hospital, police) have insufficient dialogues.
3 https://github.com/InitialBug/ MarCo-Dialog word embeddings and hidden size are both set to 128. We use a 3-layer transformer with 4 heads for the multi-head attention layer. For decoding, we use a beam size of 2 to search for optimal results, and apply trigram avoidance (Paulus et al., 2018) to fight trigram-level repetition. During training, we first train the act generator for 10 epochs for warmup and then optimize the uncertainty loss with the Adam optimizer (Kingma and Ba, 2015).

Baselines
A few mainstream models are used as baselines for comparison with our neural co-generation model (MARCO), being categorized into three categories: • Sequential Act. Since our model does not rely on BERT, to make a fair comparison with HDSA, we design the experiments from two aspects to ensure they have the same dialogue act inputs for response generation. First, the act sequences produced by our co-generation model are converted into one-hot vectors and fed to HDSA. Second, the predicted one-hot act vectors by BERT are transformed into act sequences and passed to our model as inputs.

Overall Results
The overall results are shown in  results confirm the success of MARCO by modeling act prediction as a generation problem and training it jointly with response generation. Another observation is that despite its strong overall performance, MARCO shows inferior BLEU performance to the two HDSA models. The reason behind this is studied and analyzed in human evaluation (Section 7), showing that our model often generates responses inconsistent with references but favored by human judges.
The performance of our model across different domains is also compared against HDSA. The average number of turns is 8.93 for singledomain dialogues and 15.39 for multi-domain dialogues (Budzianowski et al., 2018). As in Figure   which is an updated version of MultiWOZ 2.0. As shown in Table 2, the overall results are consistent with that on MultiWOZ 2.0.

Further Analysis
More thorough studies and analysis are conducted in this section, trying to answer three questions: (1) How is the performance of our act generator in comparison with existing classification methods?
(2) Can our joint model successfully build semantic associations between acts and responses? (3) How does the uncertainty loss contribute to our co-generation model?

Dialogue Act Prediction
To evaluate the performance of our act generator, we compare it with several baseline methods mentioned in , including BiL-STM, Word-CNN, and 3-layer Transformer. We use MARCO to represent our act generator which is trained jointly with the response generator, and use Transformer (GEN) to denote our act generator without joint training. From Table 3, we notice that the separate generator, Transformer (GEN), performs much better than BiLSTM and Word-CNN, but comparable with Transformer. But after trained jointly with the response generator, MARCO manages to show the best performance, confirming the effect of the co-generation.   Table 4: Results of response generation by joint and pipeline models, where Pipeline 1 and Pipeline 2 represent two pipeline approaches with or without using dynamic act attention. The performance of HDSA, as the best pipeline model, is provided for comparison.

Joint vs. Pipeline
To study the influence of the joint training and the dynamic act attention on response generation, we implement two pipeline approaches for comparison. We first train our act generator separately from response generation. Then, we keep its parameters fixed and train the response generator. The first baseline is created by replacing the dynamic act attention (Equation 7) with an average of the act hidden states, while the second baseline uses the dynamic act attention. As shown in Table 4, Pipeline 2 with dynamic act attention is superior to Pipeline 1 without it in all metrics, but inferior to the joint approach. Our joint model also surpasses the currently state-of-the-art pipeline system HDSA, even HDSA uses BERT. We find that by utilizing sequential acts, the dynamic act attention mechanism helps the response generator capture the local information by attending to different acts. An illustrative example is shown in Figure 5, where the response generator can attend to the local information such as "day" and "stay" as needed when generating a response asking about picking a different day or shorter stay. We reckon that by utilizing sequential acts, response generation benefits in two ways. First, the dynamic act attention allows the generator to attend to different acts when Sequencial Act Response Sequence Figure 5: An illustrative example of the dynamic act attention mechanism. Response (row) subsequence can attend to the act (column) token "day" or "stay" as needed when generating a response asking about picking a different day or shorter stay. generating a subsequence. Second, the joint training makes the two stages interact with each other, easing error propagation of pipeline systems.

Uncertainty Loss
We opt for an uncertainty loss to optimize our joint model, rather than a traditional weighted-sum loss. To illustrate their difference, we conduct an experiment on the development set. For the traditional loss (Equation 11), we run for each weight from 0 to 1 stepped by 0.1. Note that since the weights, σ 1 and σ 2 , in the uncertainty loss are not hyperparameters but learned internally to each batch, we only record the best score within each round without giving the values of σ 1 and σ 2 . As shown in Figure  6, the uncertainty loss can learn adaptive weights with consistently superior performance.

Human Evaluation
We conduct a human study to evaluate our model by crowd-sourcing. 4 For this purpose we randomly selected 100 sample dialogues (742 turns in total) from the test dataset and constructed two groups of systems for comparison: MARCO vs. HDSA and Human Response (ground-truth). "Win", "Tie" or "Lose" respectively indicate the proportions that our MARCO system wins over, ties with or loses to its counterpart.
MARCO vs. Human Response, where Human Response means the reference responses. Responses generated by each group were randomly assigned in pairs to 3 judges, who ranked them according to their completion and readability Zhang et al., 2019). Completion measures if the response correctly answers a user query, including relevance and informativeness. Readability reflects how fluent, natural and consistent the response is. The results of this study are shown in Figure 7, where "Win", "Tie" or "Lose" mean our MARCO system wins over, ties with or loses to its counterpart, respectively. From the results we note that MARCO outperforms HDSA and Human Response in completion, and ties 94% with HDSA in readability while underperforming Human Response. Overall speaking, MARCO is superior to HDSA and comparable with Human Response. We further analyzed the bad cases of our model in readability and found that our model slightly suffers from token level repetition, a problem that can be solved by methods like the coverage mechanism (Mi et al., 2016;Tu et al., 2016). In completion, our model can understand the users' need and tends to provides them more relevant information, so that they can finish their goals in shorter turns.
We present two examples in Figure 8. In the first example, the user requests the hotel type while HDSA ignores it. The user requests to book one ticket in the second example, yet both HDSA and Human Response ask about the number once again.
In contrast, our model directly answers the questions with correct information. To sum up, MARCO successfully improves the dialogue system by generating relevant and informative responses. I was able to book one ticket for you. Your reference number is R57G4DU4. Figure 8: Two examples to show that MARCO successfully improves the dialogue system by generating relevant and informative responses.

Conclusion
In this paper, we presented a novel co-generation model for dialogue act prediction and response generation in task-oriented dialogue systems. Unlike previous approaches, we modeled act prediction as a sequence generation problem to exploit the semantic structures of acts and trained it jointly with response generation via dynamic attention from response generation to act prediction. To train this joint model, we applied an uncertainty loss for adaptive weighting of the two tasks. Extensive studies were conducted on a large-scale task-oriented dataset to evaluate the proposed model, and the results confirm its effectiveness with very favorable performance over several state-of-the-art methods.