Goal-Embedded Dual Hierarchical Model for Task-Oriented Dialogue Generation

Hierarchical neural networks are often used to model inherent structures within dialogues. For goal-oriented dialogues, these models miss a mechanism adhering to the goals and neglect the distinct conversational patterns between two interlocutors. In this work, we propose Goal-Embedded Dual Hierarchical Attentional Encoder-Decoder (G-DuHA) able to center around goals and capture interlocutor-level disparity while modeling goal-oriented dialogues. Experiments on dialogue generation, response generation, and human evaluations demonstrate that the proposed model successfully generates higher-quality, more diverse and goal-centric dialogues. Moreover, we apply data augmentation via goal-oriented dialogue generation for task-oriented dialog systems with better performance achieved.


Introduction
Modeling a probability distribution over word sequences is a core topic in natural language processing, with language modeling being a flagship problem and mostly tackled via recurrent neural networks (RNNs) (Mikolov and Zweig, 2012;Melis et al., 2017;Merity et al., 2018).
Recently, dialogue modeling has drawn much attention with applications to response generation (Serban et al., 2016a;Li et al., 2016b;Asghar et al., 2018) or data augmentation (Yoo et al., 2019). It's inherently different from language modeling as the conversation is conducted in a turn-by-turn nature. (Serban et al., 2016b) imposes a hierarchical structure on encoder-decoder to model this utterance-level and dialogue-level structures, followed by (Serban et al., 2016c;Chen et al., 2018;Le et al., 2018a).
However, when modeling dialogues involving two interlocutors center around one or more goals, these systems generate utterances with the greatest likelihood but without a mechanism sticking to Hi , I need a train to Boston. Sure, what day will you be traveling ?
I need to leave on tomorrow and arrive before 5 pm.
Train 3315 leaves at 10 am, would you like to book it?
No, I only need the price.
The price is $100.
I also need a hotel to stay.
Sure. Do you have a price range in mind?
The reference number is RVUSFG. Is there anything else I can help? I'm also looking for information on a museum. the goals. This makes them go off the rails and fail to model context-switching of goals. Most of the generated conversations become off-goal dialogues with utterances being non-relevant or contradicted to goals rather than on-goal dialogues. The differences are illustrated in Figure 1. Besides, two interlocutors in a goal-oriented dialogue often play distinct roles as one has requests or goals to achieve and the other provides necessary support. Modeled by a single hierarchical RNN, this interlocutor-level disparity is neglected and constant context switching of roles could reduce the capacity for tracking conversational flow and long-term temporal structure.
To resolve the aforementioned issues when modeling goal-oriented dialogues, we propose the Goal-Embedded Dual Hierarchical Attentional Encoder-Decoder (G-DuHA) to tackle the problems via three key features. First, the goal embedding module summarizes one or more goals of the current dialogue as goal contexts for the model to focus on across a conversation. Second, the dual hierarchical encoder-decoders can naturally capture interlocutor-level disparity and represent interactions of two interlocutors. Finally, attentions are introduced on word and dialogue levels to learn temporal dependencies more easily.
In this work, our contributions are that we propose a model called goal-embedded dual hierarchical attentional encoder-decoder (G-DuHA) to be the first model able to focus on goals and capture interlocutor-level disparity while modeling goal-oriented dialogues. With experiments on dialogue generation, response generation and human evaluations, we demonstrate that our model can generate higher-quality, more diverse and goalfocused dialogues. In addition, we leverage goaloriented dialogue generation as data augmentation for task-oriented dialogue systems, with better performance achieved.

Related Work
Dialogues are sequences of utterances, which are sequences of words. For modeling or generating dialogues, hierarchical architectures are usually used to capture their conversational nature. Traditionally, language models are also used for modeling and generating word sequences. As goaloriented dialogues are generated, they can be used in data augmentation for task-oriented dialogue systems. We review related works in these fields.
Dialogue Modeling. To model conversational context and turn-by-turn structure of dialogues, (Serban et al., 2016b) devised hierarchical recurrent encoder-decoder (HRED). Reinforcement and adversarial learning are then adopted to improve naturalness and diversity (Li et al., 2016b(Li et al., , 2017a. Integrating HRED with the latent variable models such as variational autoencoder (VAE) (Kingma and Welling, 2014) extends another line of advancements (Serban et al., 2016c;Zhao et al., 2017;Park et al., 2018;Le et al., 2018b). However, these systems are not designed for taskoriented dialogue modeling as goal information is not considered. Besides, conversations between two interlocutors are captured with a single encoder-decoder by these systems.
Task-Oriented Dialogue Systems. Conventional task-oriented dialog systems entails a sophisticated pipeline (Raux et al., 2005;Young et al., 2013) with components including spoken language understanding (Chen et al., 2016;Mesnil et al., 2015;Gupta et al., 2019), dialog state tracking (Henderson et al., 2014;Mrksic et al., 2017), and dialog policy learning (Su et al., 2016;Gašić and Young, 2014). Building a task-oriented dialogue agent via end-to-end approaches has been explored recently (Li et al., 2017b;Wen et al., 2017). Although several conversational datasets are published recently (Gopalakrishnan et al., 2019;Henderson et al., 2019), the scarcity of annotated conversational data remains a key problem when developing a dialog system. This motivates us to model task-oriented dialogues with goal information in order to achieve controlled dialogue generation for data augmentation.

Model Architecture
Given a set of goals and the seed user utterance, we want to generate a goal-centric or on-goal dialogue that follows the domain contexts and corresponding requests specified in goals. In this section, we start with the mathematical formulation, then introduce our proposed model, and describe our model's training objective and inference.
At training time, K dialogues {D 1 , ..., D K } are given where each D i associates with N i goals g i = {g i1 , g i2 , ..., g iN i }. A dialogue D i consists of M turns of utterances between a user u and a system agent s (w u1 , w s1 , w u2 , w s2 , ...), where w u1 is a word sequence w u1,1 , w u1,2 , ..., w u1,N u1 denoting the user's first utterance.
The task-oriented dialogue modeling aims to approximate the conditional probability of user's or agent's next utterance given previous turns and goals. It can be further decomposed over gener-  ated words, e.g.
To model goal-oriented dialogues between two interlocutors, we propose Goal-embedded Dual Hierarchical Attentional Encoder-Decoder (G-DuHA) as illustrated in Fig. 2. Our model comprises goal embedding module, dual hierarchical RNNs, and attention mechanisms detailed below.

Goal Embedding Module
We represent each goal in {g i1 , g i2 , ...} using a simple and straightforward binary or multi-onehot encoding followed by a feed-forward network (FFN). A goal could have a specific domain such as hotel or a request such as price or area, Table  2 shows a few examples. Our goal embedding module, a FFN, then converts binary encoding of each goal into a goal embedding, where the FFN is learned during training. If multiple goals present, all goal embeddings are then added up elementwisely to be the final goal embedding: The output of the goal embedding module has the same number of dimensions as context RNN's hidden state and is used as initial states for all lay-ers of all context RNNs to inform the model about a set of goals to focus on.

Dual Hierarchical Architecture
The hierarchical encoder-decoder structure (Serban et al., 2016b) is designed for utterance level and context level modeling. With a single encoder, context RNN, and decoder, the same module is used to process input utterances, track contexts, and generate responses for both interlocutors.
In task-oriented dialogues, however, roles are distinct as the user aims to request information or make reservations to achieve goals in mind and the system agent provides necessary help. To model this interlocutor-level disparity, we extend it into a dual architecture involving two hierarchical RNNs, each serves as a role in a dialogue.

Attention Mechanisms
At the utterance level, as importance of words can be context dependent, our model uses first hidden layer's states of the context RNN as the query for attention mechanisms (Bahdanau et al., 2015;Xu et al., 2015) to build an utterance representation. A feed-forward network is involved to computed attention scores, whose input is the concatenation of the query and an encoder output.
At the dialogue level, for faster training, our model has a skip connection to add up context RNN's raw output with its input as the final output c t . To model long-term dependencies, an attention module is applied to summarize all previous contexts into a global context vector. Specifically, a feed-forward network takes the current context RNN's output c t as the query and all previous context outputs from both context RNNs as keys and values to compute attention scores. The global context vector is then concatenated with c t to form our final context vector for decoder to consume.

Objective
For predicting the end of a dialogue, we exploit a feed-forward network over the final context vector for a binary prediction. Thus our training objective can be written as (3) where our model p θ has parameters θ, M i is the number of turns, e it is 0 as the dialogue D i continues and 1 if it terminates at turn t.

Generation
At dialogue generation time, a set of goals {g i1 , g i2 , ...} and a user utterance w u1 as a seed are given. Then our model will generate conversations simulating interactions between a user and an agent that seek to complete all given goals. The generation process terminates as the end of dialogue prediction outputs a positive or the maximum number of turns is reached.

Experiments
We evaluate our approach on dialogue generation and response generation as well as by humans. Ablation studies and an extrinsic evaluation that leverages dialogue generation as a data augmentation method are reported in the subsequent section.

Dataset
Experiments are conducted on a task-oriented human-human written conversation dataset called MultiWOZ (Budzianowski et al., 2018), the largest publicly available dataset in the field. Dialogues in the dataset span over diverse topics, one to more goals, and multiple domains such as restaurant, hotel, train, etc. It consists of 8423 train dialogues, 1000 validation and 1000 test dialogues with on average 15 turns per dialogue and 14 tokens per turn.

Baselines
We compare our approach against four baselines: (i) LM+G: As long-established methods for language generation, we adopt an RNN language model (LM) with 3-layer 200-hidden-unit GRU (Cho et al., 2014) incorporating our goal embedding module as a baseline, which has goal information but no explicit architecture for dialogues.
(ii) LM+G-XL: To show the possible impact of model size, a larger LM that has a 3-layer 450hidden-unit GRU is adopted as another baseline.
(iii) Hierarchical recurrent encoder-decoder (HRED) (Serban et al., 2016b): As the prominent model for dialogues, we use HRED as the baseline that has a dialogue-specific architecture but no goal information. The encoder, decoder, and context RNN are 2-layer 200-hidden-unit GRUs.
(iv) HRED-XL: We also use a larger HRED with 350 hidden units for all GRUs as a baseline to show the impact of model size.

Implementation Details
In all experiments, we adopt the delexicalized form of dialogues as shown in Table 2 with vocabulary size, including slots and special tokens, to be 4258. The max number of turns and sequence length are capped to 22 and 36, respectively. G-DuHA uses 2-layer, 200-hidden-unit GRUs as all encoders, decoders, and context RNNs. All feed-forward networks have 2 layers with nonlinearity. FFNs of encoder attention and end of dialogue prediction have 50 hidden units. The FFNs of context attention and goal embedding gets 100 and 200 hidden units. We simply use greedy decoding for utterance generation.
All models initialize embeddings with pretrained fast-text vectors on wiki-news (2018) and are trained by the Adam optimizer (2015) with early-stopping to prevent overfitting. To mitigate the discrepancy between training and inference, we pick predicted or ground-truth utterance as the current input uniformly at random when training.

Evaluation Metrics
We employ a number of automatic metrics as well as human evaluations to benchmark competing models on quality, diversity, and goal focus: Quality. BLEU (Papineni et al., 2002), as BLEU-4 by default, is a word-overlap measure against references and commonly used by dialogue generation works to evaluate quality (2015;  2016b; 2016a; 2017; 2018). Lower N-gram B1, B2, B3 are also reported. Diversity. D-1, D-2, D-U: The distinctiveness denotes the number of unique unigrams, bigrams, and utterances normalized by each total count (Li et al., 2016a;Xu et al., 2018). These metrics are commonly used to evaluate the dialogue diversity.
Goal Focus. A set of slots such as address are extracted from reference dialogues as multilabel targets. Generated slots in model's output dialogues are the predictions. We use the multilabel precision, recall, and F1-score as surrogates to measure the goal focus and achievement.
Human Evaluation. The side-by-side human preference study evaluates dialogues on goal focus, grammar, natural flow, and non-redundancy.

Dialogue Generation Results
For dialogue generation (Li et al., 2016b), a model is given one or more goals and one user utterance as the seed inputs to generate entire dialogues in an auto-regressive manner. Table 1 summarizes the evaluation results. For quality measures, G-DuHA significantly outperforms other baselines, implying that it's able to carry out a higher-quality dialogue. Besides, goal-embedded LMs perform better than HREDs, showing the benefits of our goal embedding module. No significant performance difference is observed with respect to model size variants.
For diversity evaluations, G-DuHA is on par with goal-embedded LMs and both outperform HRED significantly. Of 1000 generated dialogues, HRED delivers highly repetitive outputs with only 4 to 6% distinct utterances, whereas 25% of utterances are unique from G-DuHA.
For recovering slots in reference dialogues, precision denotes a degree of goal deviation, recall entails the achievement of goals, and F1 measures the overall focus. Goal-embedded LM is the best on precision and F1 with G-DuHA having com-    parable performance. However, even thought LM can better mention the slots in dialogue generation, utterances are often associated with a wrong role. That is, role confusions are commonly seen such as the user makes reservations for the agent as in Table 2. The reason could be that LM handles the task similar to paragraph generation without an explicit design for the conversational hierarchy.
Overall, G-DuHA is able to generate highquality dialogues with sufficient diversity and still adhere to goals compared to baselines.
Qualitative Comparison. Table 2 compares generated dialogues from different models given one to three goals to focus on. It's clear that models with the goal embedding module are able to adhere to given goals such as "book" or "no book", requesting "price" or "entrance fee" while HRED fails to do so. They can also correctly covering all required domain contexts without any diversion such as switching from attraction inquiry to restaurant booking, then to taxi-calling. For HRED, without goals, generated dialogues often detour to a non-relevant domain context such as shifting to train booking while only hotel inquiry is required.
For goal-embedded LM, a serious issue revealed is role confusions as LM often wrongly shifts between the user and agent as shown in Table 2. The issue results from one wrong EndofUtterance prediction but affects rest of dialogue and degrades the overall quality. More generated dialogues are reported in the appendix.

Wins
Losses Ties Goal Focus 82.33% 6.00% 11.67% Grammar 6.00% 5.00% 89.00% Natural Flow 26.00% 15.00% 59.00% Non-redundancy 35.34% 6.33% 58.33% Table 6: Human evaluations, G-DuHA vs HRED. 100 pairs of generated dialogues along with goals are given to three domain experts for side-by-side comparisons. reference utterances, are given to a model to generate the next utterance as a response. Table 3 and 4 summarize the results. G-DuHA outperforms others on quality and goal focus measures and rivals LM-goal on diversity on both agent and user responses. For goal focus, LMgoal performs good on precision but short on recall. This could because it generates much shorter user and agent responses on average.
Interestingly, as previous contexts are given, LM-goal performs only slightly better than HRED. This implies hierarchical structures capturing longer dependencies can make up the disadvantages of having no goal information for response generation. However, as illustrated in Table 12, HRED could still fail to predict the switch of domain contexts, e.g. from train to restaurant, which explains performance gaps. Another intriguing observation is that when incorporating the goal embedding module, response diversity and goal focus can be boosted significantly.
Comparing the performance between agent and user response generation, we observe that models can achieve higher quality and diversity but lower goal focus when modeling agent's responses. These might result from the relatively consistent utterance patterns but diverse slot types used by an agent. More generated responses across different models are presented in the appendix.

Human Evaluation Results
For human evaluation, we conduct side-by-side comparisons between G-DuHA and HRED, the widely used baseline in literature, on dialogue generation task. We consider the following four criteria: goal focus, grammar, natural flow, and non-redundancy. Goal focus evaluates whether the dialogue is closely related to the preset goals; grammar evaluates whether the utterances are well-formed and understandable; natural flow evaluates whether the flow of dialogue is logical and fluent; and non-redundancy evaluates whether the dialogue is absent of unnecessary repetition of mentioned information. 100 pairs of generated dialogues from G-DuHA and HRED along with their goals are randomly placed against each other. For each goal and pair of dialogues, three domain experts were instructed to set their preferences with respect to each of the four criteria, marked as win / lose / tie between the dialogues. Table 6 presents the results. G-DuHA shows substantial advantages on goal focus, with 82.33% wins over HRED, confirming the benefits of our goal embedding module. G-DuHA also outperforms HRED significantly on natural flow and non-redundancy. These might result from G-DuHA's ability to generating much more diverse utterances while concentrating on current goals. An especially interesting observation is that in cases where multiple goals are given, G-DuHA not only stays focused on each individual goal but also generates intuitive transitions between goals, so that the flow of a dialogue is natural and coherent. An example is shown in Table 2, where the G-DuHA-generated dialogue switches towards the taxi goal while maintaining reference to the previously mentioned attraction and restaurant goals: ". . . i also need a taxi to commute between the 2 places . . . ". We also observe that both G-DuHA and HRED performed well on grammaticality. The generated samples across all RNN-based models are almost free from grammar error as well.

Ablation Studies
The ablation studies are reported in Table 7 for dialogue generation and in Table 8 for response generation to investigate the contribution of each module. Here we evaluate user and agent response generation together. Goal Embedding Module. First, we examine the impact of goal embedding module. When unplugging the goal embedding module, we observe significant and consistent drops on quality, diversity, and goal focus measures for both dialogue and response generation tasks. For dialogue generation task, the drops are substantially large which resonates with our intuition as the model only has the first user utterance as input context to follow.
With no guideline about what to achieve and what conversation flow to go around with, dialogues generated from HRED often have the similar flow and low diversity. These results demon-    strate that our goal embedding module is critical in generating higher-quality and goal-centric dialogues with much more diversity.
Dual Hierarchical Architecture. We also evaluate the impact of dual hierarchical architecture. Comparisons on both dialogue and response generation tasks show a consistent trend. We observe that applying dual architecture for interlocutorlevel modeling leads to a solid increase in utterance diversity as well as moderate improvements on quality and goal focus.
The results echo our motivation as two interlocutors in a goal-oriented dialogue scenario exhibit distinct conversational patterns and this interlocutor-level disparity should be modeled by separate hierarchical encoder-decoders.
For the dialogue-level attention module, there is no significant effect on diversity and goal focus on both tasks but it marginally improves the overall utterance quality as BLEU scores go up by a bit.

Data Augmentation via Dialogue Generation
As an exemplified extrinsic evaluation, we leverage the goal-oriented dialogue generation as data augmentation for task-oriented dialogue systems. Dialogue state tracking (DST) is used as our evaluation task which is a critical component in taskoriented dialogue systems (Young et al., 2013) and has been studied extensively (Henderson et al., 2014;Mrksic et al., 2017;Zhong et al., 2018). In DST, given the current utterance and dialogue history, a dialogue state tracker determines the state of the dialogue which comprises a set of requests and joint goals. For each user turn, the user informs the system a set of turn goals to fulfill, e.g. inform(area=south), or turn requests asking for more information, e.g. request(phone). The joint goal is the collection of all turn goals up to the current turn.
We use the state-of-the-art Global-Locally Self-Attentive Dialogue State Tracker (GLAD) (Zhong et al., 2018) as our benchmark model and the WoZ restaurant reservation dataset (Wen et al., 2017;Zhong et al., 2018) as our benchmark dataset, which is commonly used for the DST task.
The dataset consists of 600 train, 200 validation and 400 test dialogues. We use the first utterances from 300 train dialogues and sample restaurantdomain goals to generate dialogues, whose states are annotated by a rule-based method. Table 9 summarizes the augmentation results. Augmentation with G-DuHA achieved an improvement over the vanilla dataset and outperform HRED on turn requests while being comparable on joint goal. For goal-embedded LM, as it struggles with role confusion, the augmentation actually hurts the overall performance.

Conclusion
We introduced the goal-embedded dual hierarchical attentional encoder-decoder (G-DuHA) for goal-oriented dialogue generation. G-DuHA is able to generate higher-quality and goal-focused dialogues as well as responses with decent diversity and non-redundancy. Empirical results show that the goal embedding module plays a vital role in the performance improvement and the dual architecture can significantly enhance diversity.
We demonstrated one application of the goaloriented dialogue generation through a data augmentation experiment, though the proposed model is applicable to other conversational AI tasks which remains to be investigated in the future.
As shown in experiments, a language model coupled with goal embedding suffers from roleswitching or confusion. It's also interesting to further dive deep with visualizations (Kessler, 2017) and quantify the impact on quality, diversity, and goal focus metrics.

A Appendix
A.1 More Generated Dialogues for Qualitative Comparison In Table 10 and Table 11, we list and compare more generated dialogues of different models given a variable number of goals to focus on.
The domains, what slots to request such as phone or postcode, and when should the user make a reservation or not are specified in goals. We use bold style to emphasize whether the assigned goals are achieved or missed such as requests of a correct slot or a wrong domain. The underline and bold style denotes the role confusions.

A.2 More Generated Responses for Qualitative Comparison
Responses are generated by a model given the goals and all the previous utterances. In Table 12, we present more generated responses for qualitative comparison. We observe that HRED is unable to correctly switch beteen goal contexts illustrated by the second example.