Dual Dynamic Memory Network for End-to-End Multi-turn Task-oriented Dialog Systems

Existing end-to-end task-oriented dialog systems struggle to dynamically model long dialog context for interactions and effectively incorporate knowledge base (KB) information into dialog generation. To conquer these limitations, we propose a Dual Dynamic Memory Network (DDMN) for multi-turn dialog generation, which maintains two core components: dialog memory manager and KB memory manager. The dialog memory manager dynamically expands the dialog memory turn by turn and keeps track of dialog history with an updating mechanism, which encourages the model to filter irrelevant dialog history and memorize important newly coming information. The KB memory manager shares the structural KB triples throughout the whole conversation, and dynamically extracts KB information with a memory pointer at each turn. Experimental results on three benchmark datasets demonstrate that DDMN significantly outperforms the strong baselines in terms of both automatic evaluation and human evaluation. Our code is available at https://github.com/siat-nlp/DDMN.


Introduction
Task-oriented dialog systems are designed to help users achieve specific goals with natural language, such as weather inquiry or restaurant reservation. Compared with traditional pipeline methods (Williams and Young, 2007;Young et al., 2013), end-to-end approaches recently have gained much attention (Zhao et al., 2017;Eric and Manning, 2017;, since they free the task-oriented dialog systems from the manually designed pipeline modules and can be automatically scaled up to new domains. Recently, sequence-to-sequence (Seq2Seq) models have dominated the study of end-to-end taskoriented dialog systems (Bordes et al., 2017). Different from typical Seq2Seq models for open-domain dialog systems, the successful conversations for task-oriented dialog systems heavily rely on both dialog history and domain-specific knowledge base (KB). To effectively incorporate KB information and perform knowledge-based reasoning, memory augmented models have been proposed Wu et al., 2019), which model the dialog history and the KB knowledge as a bag of words in a flat memory.
Despite the remarkable progress of previous studies, current memory based models for multi-turn taskoriented dialog systems still suffer from the following limitations. First, existing methods concatenate dialog utterances of current turn and previous turns as a whole, which ignore previous reasoning process performed by the model and are incapable of dynamically tracking long-term dialog states. These methods introduce much noise since previous utterances as the context is lengthy and redundant (Zhang et al., 2018). Taking the dialog in Table 1 as an example, when answering the user question in 6-th turn, it is difficult for the model to infer that the name of the restaurant is "cocum" from a long concatenated dialog context. Therefore, previous models struggle to work well in the situations that require many rounds of interactions to complete a specific task. Second, previous studies tend to confound dialog history with KB knowledge, and store them into a flat memory (Sukhbaatar et al., 2015;Eric and Manning, 2017; et al., 2018). The shared memory suffers from encoding the dialog context and KB information using a single strategy, which makes it hard to efficiently reason over the two different types of data, especially when the memory is large.
To alleviate the aforementioned limitations, we propose a Dual Dynamic Memory Network (DDMN), which keeps track of the long-term dialog history and KB knowledge with separate memories. Specifically, we leverage a dialog memory manager to effectively maintain history utterances with a dialog history memory and a dialog state memory. The dialog history memory keeps fixed to store the representation of dialog context throughout the whole conversation, and the dialog state memory keeps updated at each turn to track the flow of history information and capture proper information of current turn for generation. We leverage a KB memory manager containing a KB memory and a KB memory pointer to effectively track KB knowledge. The KB memory stores the KB triples using an end-to-end memory network (Sukhbaatar et al., 2015) and is shared across the entire conversation. The KB memory pointer softly attends to the KB memory at each turn, and guides the model to select appropriate KB entries in decoding.
Our main contributions can be summarized as follows.
• We propose a Dual Dynamic Memory Network (DDMN) for task-oriented dialog systems, which dynamically keeps track of long dialog context for multi-turn interactions and effectively incorporates KB knowledge into generation.
• We employ separate memories to model dialog context and KB triples. The iterative interactions between the two kinds of memories make the decoder focus on relevant dialog context and KB facts for generating coherent and human-like dialogs.
• The experimental results on three public datasets show that DDMN achieves impressive results compared to the existing methods. More importantly, our model is able to maintain more sustained conversations than the compared methods with the increase of dialog turns.

Model Description
denote a set of dialogs, and M is the number of dialog turns. u i and s i denote the user utterances and system responses, respectively. Given a context X with utterances {d j } m−1 j=1 and u m , a sequence of KB triples B = {b 1 , b 2 , . . . , b l }, where m denotes the turn of current dialog, l is the number of KB triples, each triple is composed of subject, relation, object . The objective of task-oriented dialog generation is to generate a proper response Y = {y 1 , y 2 , . . . , y n } word by word.
As shown in Figure 1, our proposed DDMN architecture consists of four components: a dialog encoder, a dialog memory manager, a KB memory manager, and a decoder. We elaborate on the proposed model in detail as below.

Dialog Encoder
To overcome the challenge of modeling long dialog context in multi-turn conversations, the dialog encoder encodes dialog history utterances turn by turn. Specifically, for the first turn, the input to the encoder is u 1 . For j-th (j > 1) turn, the input is {s j−1 , u j }, which is the concatenation of system response of the previous turn and user utterance of the current turn. Concretely, the input of dialog encoder at each turn are a sequence of tokens x = (x 1 , x 2 , . . . , x n ), where n is the number of tokens. We first convert each token into a word vector through a randomly initialized trainable embedding matrix, and then employ a bidirectional gated recurrent unit (BiGRU) (Chung et al., 2014) to encode the input into hidden states: (1) where e(x t ) is the embedding of the token x t . We take the concatenation of the forward and backward hidden states as the the output of the encoder, denoted as H = (h 1 , . . . , h n ), which is then passed into the dialog memory manager for global management.

Dialog Memory Manager
The dialog memory manager maintains a dialog history memory and a dialog state memory, which are both initialized with the encoder hidden states of the first turn. For j-th (j > 1) turn, both the dialog history memory and the dialog state memory are "expanded" by concatenating the hidden states of j-th turn. The two memories are maintained throughout the whole conversation, with the dialog history memory keeping fixed to store the representation of dialog context of all previous turns. The dialog state memory keeps updated at each turn, which aims to track the flow of history information and capture proper information of current turn for response generation.
Generally, the decoder applies a GRU network to generate response word by word. At step t, the decoder state s t can be updated by: (2) where e(y t−1 ) is the embedding of the previous word y t−1 . Here, s t is regarded as a "query" vector q t , which is used to attend to the dialog state memory and obtain the weighted context representation c t by reading from the dialog history memory. Then the dialog state memory will be updated with q t and c t by R rounds. Formally, let K ∈ R N ×d and V ∈ R N ×d be the dialog state memory and dialog history memory respectively, where N is the number of the memory slots and d is the dimension of vector in each slot, the detailed memory updating operations at round r(r ∈ [1, R]) are introduced as below.
Dialog State Memory Addressing The addressing operation aims to specify the normalized weights assigned to memory slots in K (r−1) (the dialog state memory at r − 1-th round), which formulates an attention vectorã is the j-th slot in K (r−1) at time step t.

Dialog History Memory Reading
The reading operation reads from the dialog history memory V to get the context representationc t . The output of reading is given bỹ c Dialog State Memory Updating Inspired by the read-write operations (Meng et al., 2016;Meng et al., 2018), we define two types of operations for updating the dialog state memory: FORGET and ADD. FORGET is analogous to the forget gate in GRU, which determines the information to be removed from memory slots. Similarly, ADD operation decides how much current information should be written to the dialog state memory as the added content.
Specifically, we first deploy another GRU network to imitate the decoder at round r, and obtain the "intermediate" hidden states t is used to update the dialog state memory. Then, the "intermediate" dialog state memory after FORGET operation is given by: where the computation of a F ∈ R d×d is a learnable parameter. The dialog state memory after ADD operation is given by: where W

(r)
A ∈ R d×d is a learnable parameter. After R rounds of updating, the dialog state memory is modified with FORGET and ADD operations. Due to the "expansion" of dialog state/history memory along with the increase of the dialog turns, the dialog memory manager is able to dynamically keep track of long-term dialog state.

KB Memory Manager
To incorporate external knowledge effectively, the KB memory manager adopts end-to-end multihop memory networks (MemNN) (Sukhbaatar et al., 2015) for encoding structural KB information. Given the KB triples B = {b 1 , b 2 , . . . , b l }, each entry b i ∈ B is represented in the format of a triple subject, relation, object . The KB memory is then represented as a set of trainable embedding matrices C = (C 1 , . . . , C K+1 ) and C k ∈ R V ×d using MemNN, where K is the number of memory hops and V is the vocabulary size of the KB. It is noteworthy that our KB memory is shared across the entire conversation. Formally, an initial query vector q 1 is used as the reading head, and it loops over K hops and computes the attention weights at each hop k as: Note that p k ∈ R l is a soft memory attention that decides the memory relevance with respect to the query vector. Then the KB memory manager reads the memory o k by the weighted sum over c k+1 and update the query vector q k+1 as: To further strengthen the ability of selecting correct KB entries, we introduce a KB memory pointer P tr kb , inspired by Wu et al. (2019). Note that the proposed pointer P tr kb is passed to the decoder turn by turn. Suppose P tr kb is denoted as a sequence of pointers (σ 1 , σ 2 , . . . , σ l ), each pointer is formulated by: where q K and c K i are the query vector and the memory content at the last hop, respectively. We add an auxiliary classification task to train Ptr kb . We first define the corresponding label Ptr label = (g 1 , g 2 , . . . , g l ) by checking whether the object words in the KB memory exist in the expected system response Y , where g i = 1 if object(b i ) ∈ Y , otherwise g i = 0. Then the KB memory pointer is trained using binary cross-entropy loss:

Decoder
The decoder generates a response word by word. In particular, a word at time step t is either generated from the vocabulary or copied from one of the two memories (dialog history memory or KB memory). First, the decoder employs a GRU network defined in Eq.
(2) for generation. The generation distribution over vocabulary P g (y t ) can be obtained by feeding the decoder state s t and c t (the reading output of dialog history memory at the last round) into a softmax layer, which is given by Second, following the copy mechanism (Gulcehre et al., 2016) that the attention scores are viewed as the probability to form the copy distribution, we adopt the addressing result of dialog state memory at the last round as the attention score a t,j , thus the copy distribution over the dialog history memory is given by P c (y t = w) = tj:w tj =w a t,j .
Third, we use the KB memory pointer P tr kb to dynamically access the KB memory, and then employ the decoder state s t defined in Eq.
(2) to attend over the KB memory: where v b ,W b , U b are parameters to be learned, c K j is the KB memory content in j-th position at the last hop. Therefore, the copy distribution over the KB memory is given by P kb (y t = w) = tj:w tj =w β t,j . Note that we copy the object word once a KB memory position (i.e., a KB triple subject, relation, object ) is pointed to.
We use a soft gate g 1 to control whether a word is generated from vocabulary or copied from dialog history memory by combining P g (y t ) and P c (y t ): g 1 = Sigmoid(W 2 [s t ; c t ] + b 2 ), P con (y t ) = g 1 P g (y t ) + (1 − g 1 )P c (y t ) (13) Moreover, we use another gate g 2 to obtain the final output distribution P (y t ) by leveraging P kb (y t ) and P con (y t ): , P (y t ) = g 2 P kb (y t ) + (1 − g 2 )P con (y t ) (14) Therefore, the decoder loss is the cross-entropy between the output distribution P (y t ) and the reference distribution p t , denoted as Loss d = − p t log(P (y t )).

Training
We train our model by minimizing the weighted-sum of the two losses: where γ is a hyper-parameter controlling the impact of Loss p . Since minimizing cross-entropy loss does not always produce the best results due to the exposure bias (Ranzato et al., 2015), we further adopt the self-critical sequence training (SCST) algorithm (Rennie et al., 2017), which is a reinforcement learning process with the reward obtained by the current model.
Specifically, we produce two separate output sequences at each training iteration: (1) the sampling output y s , which is obtained by sampling from the output distribution P (y t ) at each decoding time step, and (2) the baseline outputŷ, which is obtained by maximizing the output distribution with a greedy search. We define r(y) as the reward function, which is computed by comparing an output sequence y with the ground truth sequence using the evaluation metric of our choice. The SCST loss is given by Thus, minimizing Loss rl is equivalent to maximizing the conditional likelihood of sampled sequence y s if it obtains a higher reward than the baselineŷ, which improves the reward expectation of our model.
In-Car Assistant dataset consists of 3,031 multi-turn dialogs in three distinct domains: schedule (Sch.), weather (Wea.), and navigation (Nav.). This dataset has an average of 2.6 turns, and the KB information is complicated, which has an average of 62.3 triples for every dialog. Following the data processing in Madotto et al. (2018)

Implementation Details
Our model is trained in an end-to-end manner using Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 5 × 10 −4 . The shared size of embedding and the hidden units of GRU are sampled from {128, 256}. Both the number of rounds R and the number of hops K are set to 3. The dropout rate is set between [0.1, 0.4] and the hyper-parameter γ in the loss function is set to 1. The hyper-parameters are tuned with grid-search over the validation set using BLEU score as metric. We select the model with best BLEU score as an initialization for SCST training, and use the weighted sum of BLEU and entity F1 score as our reward metric. During the decoding stage, we use beam-search strategy with the beam size sampling from {1, 2, 4}.

Baselines
We compare our model with several existing end-to-end task-oriented dialog systems: (1) Seq2Seq/+Attn that employs standard seq2seq with and without attention over the input context (Luong et al., 2015); (2) Ptr-Unk that employs a seq2seq model with a copy mechanism to copy unknown words during generation (Gulcehre et al., 2016); (3) Mem2Seq that employs a memory network based approach with multi-hop attention for attending over dialog history and KB triples ; (4) BossNet that employs a bag-of-sequences memory network for disentangling language model from KB incorporation in taskoriented dialogs ; (5) MLM that employs a multi-level memory network for modeling dialog context and KB results separately (Reddy et al., 2019); (6) GLMP that employs a memory network with a global memory pointer and a local memory pointer to strengthen the copy ability (Wu et al., 2019).

Evaluation Metrics
Following previous works Wu et al., 2019), we evaluate our model and other baselines on two automatic evaluation metrics: BLEU (Papineni et al., 2002) and Entity F1. BLEU calculates n-gram overlaps between the generated response and the gold response. Entity F1 is computed by micro-averaging the precision and recall over KB entities in the entire set of system responses, which evaluates the performance of the model to generate relevant entities to achieve specific tasks from the provided KBs. It is noteworthy that entity F1 indicates the task-completion ability of the model, since KB entities are the key towards the dialog task.

Quantitative Results
Automatic Evaluation Table 2 shows the evaluation results on different datasets, we can observe that our framework achieves the state-of-the-art performance in terms of BLEU and overall entity F1 on all datasets. On In-Car Assistant dataset, BossNet obtains better entity F1 than Mem2Seq but with a much lower BLEU score. By analyzing the responses generated by BossNet, we reveal that BossNet tends to copy necessary entity words from the KB but many entity words are out of order compared with the gold response. MLM achieves a much higher BLEU score than previous models, which is due to its separate memories for modeling dialog context and KB results. GLMP has achieved a strong improvement on BLEU and entity F1, which is mainly benefited from its global and local memory pointers to guide the KB attention and response generation. Our models perform even better than GLMP, which verifies the effectiveness of our models in generating natural and appropriate responses. We observe similar trend on Multi-WOZ 2.1 dataset. In particular, our model achieves significant higher BLEU score than other methods. It is because the conversations in Multi-WOZ 2.1 require long-turn interactions, which further shows the effectiveness of our framework in generating correct responses.
On CamRest dataset, our models substantially and consistently outperform the baseline methods by a noticeable margin. It is noteworthy that DDMN achieves significant improvement with a highest BLEU of 19.3 and promising entity F1 score of 58.9%, while DDMN with SCST obtains the best entity F1 score of 59.1% but has a lower BLEU score. This may be because that optimizing the combined discrete reward metrics in SCST does not guarantee an increase in quality of the output, since BLEU measures the n-gram overlap while entity F1 score captures the entity words without considering the order of words. Furthermore, Figure 2 shows the changes of average BLEU scores of DDMN and several baselines along with the increase of dialog turns on CamRest dataset. The BLEU scores of the baseline models decrease sharply as the dialog turns increase while DDMN achieves much more stable performance, suggesting that certain dialog and KB modeling strategy devised in DDMN to keep track of the dialog context, KB knowledge, and previous inference process is effective.    Human Evaluation We randomly select 100 dialogs from the test data of different datasets for human evaluation. Following Wu et al. (2019), we adopt the appropriateness (Appr.) and human-likeness (Humanlike.) to judge the quality of the generated responses at the turn level. We also adopt the goal completion (Goal.) and coherence to judge the completion of overall dialog task and the overall fluency of the whole dialog at the dialog level. Three annotators are invited to independently assign the score scaled from 0 to 5 for each generated output. We report the average rating scores from all annotators and the results are shown in Table 3. The agreement ratio computed with Fless' kappa (Fleiss, 1971) is 0.57, showing moderate agreement. As shown in Table 3, DDMN outperforms the baseline methods on both turn level and dialog level, which is consistent with the automatic evaluation. In particular, DDMN obtains significant higher goal completion score and coherence score than compared methods, demonstrating the effectiveness of DDMN in modeling multi-turn interactions in task-oriented dialog generation.

Case Study
As an intuitive way to show the performance of task-oriented dialog systems, Table 4 reports some responses generated by DDMN and baseline models. We observe that Mem2Seq tends to generate repeated or inappropriate responses. For example, the responses in the first three turns generated by MemSeq are very similar in both content and sentence structure. GLMP performs much better than Mem2Seq, while its performance deteriorate with the increase of dialog turns, e.g., GLMP fails to extract correct key entities in the third and fourth turns. Compared with GLMP, DDMN is able to generate more proper and natural responses even in the last few turns during the conversation. This verifies that DDMN is capable of memorizing the key information from previous turns.

Model Ablation
To investigate the effectiveness of each module proposed in our framework, we conduct ablation test from four aspects, the results are reported in Table 5  for response generation. Second, we remove the two gates g 1 and g 2 in decoder separately, where g 1 controls a word generated from the vocabulary or copied from the dialog history memory, and g 2 controls whether a word should be copied from the KB memory. The results show that g 2 contributes more to the performance of DDMN than g 1 ,since entity words can not be copied efficiently without g 2 . Finally, we remove the KB memory pointer (w/o P tr kb ) during training, the performance drops slightly over both BLEU and entity F1 score.

Error Analysis
To better understand the limitations of the proposed model, we carry out an analysis of the errors made by DDMN. Specifically, we randomly select 100 responses generated by DDMN that achieve low human evaluation scores in the test set of In-Car Assistant. We reveal several reasons of the low evaluation scores, which can be divided into four categories.
(1) KB entries in generated responses are incorrect (33%), which occurs especially when the given KB triples are large, and it is difficult for the model to attend accurately over the KB memory.
(2) Generated responses incorrectly achieve user goals (30%) since our model can not capture user intent well sometimes. (3) The sentence structure of generated responses are not appropriate (21%), which occurs when the model directly returns KB entries, even if more information should be asked. (4) Miscellaneous errors (16%), e.g., the generated responses are grammatically incorrect or conflict with the user input.

Related Work
End-to-end methods have shown promising results recently and attracted increasing attention since they are easily adapted to a new domain. Some approaches view the dialog history as a sequence using recurrent neural networks (Eric and Manning, 2017;Gulcehre et al., 2016), which also force the KB triples with the same pattern and make it hard to perform reasoning over them. To better handle KB triples in task-oriented dialogs, the memory network based architecture (Bordes et al., 2017) and its variants (Wu et al., 2017; have been proposed and shown promising results. Mem2Seq  and GLMP (Wu et al., 2019) further augmented memory based methods by incorporating copy mechanism (Gulcehre et al., 2016), which enable the models copy words from past dialog utterances or from KB. These methods use a shared memory for the KB triples and the dialog utterances, making it difficult to reason over the memory and distinguish between the two forms of data.
Recently, there have been several works employing separate memories for modeling the dialog context and KB triples Reddy et al., 2019;Chen et al., 2019). For example, BossNet  implicitly disentangled the language model from knowledge incorporation and thus enhance the ability of copying unseen KB entries. Multi-level memory model (Reddy et al., 2019) represented the KB results with a multi-level memory instead of the form of triples. WMM2Seq (Chen et al., 2019) adopted a working memory to interact with a dialog context memory and a KB memory. Nevertheless, existing methods still ignore the flow of history information during conversations, making it struggle to perform well in long-turn interactions. Different from the aforementioned methods, we propose a dialog memory manager and a KB memory manager to dynamically track the dialog context and KB triples, respectively.

Conclusion
In this paper, we propose a novel Dual Dynamic Memory Network (DDMN) with a dialog memory manager and a KB memory manager for multi-turn end-to-end task-oriented dialog systems. DDMN encodes dialog context turn by turn and the dialog memory manager dynamically tracks the dialog history. The KB memory manager shares the KB information throughout the whole conversation with a KB memory pointer to softly distill relevant KB entries at each turn. In addition, we leverage self-critical sequence training to boost the performance of DDMN. Extensive experiments on three public dialog datasets demonstrate the superior performance of our model in both automatic and human evaluation.