Task-Oriented Conversation Generation Using Heterogeneous Memory Networks

How to incorporate external knowledge into a neural dialogue model is critically important for dialogue systems to behave like real humans. To handle this problem, memory networks are usually a great choice and a promising way. However, existing memory networks do not perform well when leveraging heterogeneous information from different sources. In this paper, we propose a novel and versatile external memory networks called Heterogeneous Memory Networks (HMNs), to simultaneously utilize user utterances, dialogue history and background knowledge tuples. In our method, historical sequential dialogues are encoded and stored into the context-aware memory enhanced by gating mechanism while grounding knowledge tuples are encoded and stored into the context-free memory. During decoding, the decoder augmented with HMNs recurrently selects each word in one response utterance from these two memories and a general vocabulary. Experimental results on multiple real-world datasets show that HMNs significantly outperform the state-of-the-art data-driven task-oriented dialogue models in most domains.


Introduction
Compared with chitchat, task-oriented dialogue systems aim at solving tasks in specific domains with grounding knowledge. Though far from handling conversation like a real human, existing taskoriented dialogue systems have shown cheerful prospect in a specific domain, e.g. Siri and Cortana are personal assistants, helping people a lot in daily life and business work.
In general, knowledge-grounded task-oriented dialogue system can be divided into three important components: understanding user utterances, fetching right knowledge from external storage * * Corresponding Author Figure 1: A multi-turn dialogue example. The upper table shows several n-tuples sampled from knowledge base. Lower table shows multi-turn dialogues. Agent needs to retrieve appropriate knowledge tuples to generate the proper response.
and replying right answer. As shown in Figure 1, agent is required to do a point-of-interest navigation. According to dialogue history, agent will fetch related knowledge base information, in our case represented as tuples (e.g. [hotel keen, poi type, rest stop], which indicates the pointof-i nterests type of hotel keen is rest stop), as an external knowledge to answer correctly and complete task.
Traditional pipeline dialogue systems (Yan et al., 2017;Rojas-Barahona et al., 2017) and some end-to-end dialogue systems rely on the predefined the slot filling labels. Despite the consumption of human efforts, these kinds of systems are difficult to adapt to new domains.
Many works, e.g. Shang et al., 2015), show that training a fully data-driven end-to-end model is a promising way to build domain-agnostic dialogue system. Their models mostly try to use the attention mechanism, including memory networks techniques, to fetch the most similar knowledge (Sukhbaatar et al., 2015), then incorporate grounding knowledge into a seq2seq neural model to generate a suitable re-sponse (Madotto et al., 2018).
However, existing memory networks equally treat information from multiple sources, e.g. sequential dialogue history and structure knowledge bases. Therefore two weaknesses arise in such methods: (1) It is difficult to model different types of structured information in only one memory network.
(2) It is also difficult to model the effectiveness of knowledge from different sources in such a single memory network. To address these issues, we expand the architecture of memory networks used in a seq2seq neural model.
Our contributions are mainly three-fold: • We propose a novel seq2seq neural conversation model augmented with Heterogeneous Memory Networks. We first model sequential dialogue history and grounding external knowledge with two different kinds of memory networks and then feed the output of context-aware memory to the context-free memory to search the representations of similar knowledge.
• Our context-aware memory networks is able to learn the context-aware latent representations and stores them into memory slots, by employing a gating mechanism when encoding dialogue history and user utterance.
• Experimental results demonstrate that our neural approach significantly outperforms the examined neural methods automatic metrics, and context-aware memory networks can learn and store more meaningful representations than the examined memory approaches.

Related Works
The end-to-end model uses deep neural net instead of several parts in pipeline models to generate responses. (Rojas-Barahona et al., 2017) propose a data-driven goal-oriented neural dialogue system by adding database operator and policy networks modules to introduce database information and track state which need extra labeling step that breaks differentiability.  propose a testbed to break down the strengths and shortcomings of end-to-end dialog systems in goal-oriented applications. Those methods treated dialogue system as the problem of learning a mapping policy from dialogue histories to agents' responses. The booming internet dialogue data lay the foundation of building data-driven models. (Ritter et al., 2011) first applied phrase-based Statistical Machine Translation (Setiawan et al., 2005). It treats the conversation system as a translation problem, a user utterance needs to be translated into an agent response. (Sutskever et al., 2014) propose Sequence to sequence model (SEQ2SEQ) architecture and apply it to neural machine translation task. SEQ2SEQ has become a general basis of natural language generation tasks, e.g. question answering (Tan et al., 2018) and question generation . By applying the RNN based encoderdecoder framework to generate responses, models (Shang et al., 2015;Cho et al., 2014b;Luong et al., 2015b) are able to utilize neural networks to learn the representation of dialogue histories and generate appropriate responses.
To deal with multi-turn information, (Sordoni et al., 2015) propose a model that represents the whole dialogue history (including the current message) with continuous representations or embeddings of words and phrases to address the challenge of the context-sensitive response generation.
By adding a knowledge base module, recent works (Ghazvininejad et al., 2018; have shown the possibility of training an end-to-end task-oriented dialogue system on the sequence to sequence architecture. Ghazvininejad et al. (Ghazvininejad et al., 2018) generalize the SEQ2SEQ approach by conditioning responses on both conversation history and external knowledge, aiming at producing more contextual responses without slot filling.
CopyNet (Gu et al., 2016) and Pointer Networks  improve model's accuracy and ability of handling of out-of-vocabulary words using neural attention. Pointer-Generator networks (See et al., 2017) apply copy mechanism to the neural generation model. Their work shows copy mechanism can improve quality in text generation. (Dhingra et al., 2017) and  apply reinforcement learning to make it differentiable.
Recent works on external memory (Graves et al., 2014;Henaff et al., 2016) provide an efficient method of introducing and reasoning different types of external information. (Sukhbaatar Figure 2: An example of Heterogeneous Memory Networks with two-hop attention. Context-aware memory which encodes dialog history to a context vector oc 2 while context-free memory loads knowledge base information. The output of context-aware memory will be employed as the query vector to the context-free memory. et al., 2015) propose end-to-end memory networks with multiple attention hops model over a possibly large external memory. (Madotto et al., 2018) propose Mem2Seq that combines the end-to-end memory networks with the idea of pointer networks. (Chen et al., 2018) add the hierarchical structure and the variational memory network to capture both the high-level abstract variations and long-term memories during the dialogue tracking. To take care of information from different sources, (Fu and Feng, 2018) propose an attention mechanism to encourage the decoder to actively interact with the memory by taking its heterogeneity into account.

Proposed Framework
To generate responses using dialogue history and grounding knowledge, we introduce a novel encoder-decoder neural conversation model augmented with Heterogeneous Memory Networks (HMNs). The encoder module adopts a contextaware memory network to better understand the dialogue history and query. The decoder is enhanced with HMNs, which is able to incorporate external knowledge and dialog history when generating words.

Encoder
The encoder encodes the dialogue history into a fixed context vector. Here we adopt the contextaware memory as our encoder module. As shown in the left part of Figure 2, each word will be extended to the following parts: 1) token itself. 2) A turn tag. 3) An identity tag. For example, in the first turn a user says "hello" and the response from system is "may I help you", it will be concatenated as [(hello, t1, user),(may, t1, sys),(I, t1, sys),(help, t1, sys), (you, t1, sys)], sys means the word comes Figure 3: Decoder module with two attention hops from the dialog system, so does user. t1 indicates the word is from the first turn. Each word can be transformed into vector by embedding lookup, and we sum up vectors in each tuple to be the input sequence of the context-aware memory. Using a fixed vector to query the memory, a context vector c can be obtained.

Context-Aware Memory
To efficiently model the context information of sequential data, we present the context-aware memory.
Memory slots n k are structured by concatenating all input vectors as n k = cat[n k 1 , n k 2 , ..., n k l ], k stands for the k-th hop in memory, and l means the length of the input. n k l is the sum of each word tag embedding using each hop own randomly initialized embedding matrix C k . And we adopt the adjacent weight sharing scheme, which means C k is not only the input embedding matrix in the k-th hop, but also the output embedding matrix in the (k-1)-th hop. We add a gating mechanism between memory cells. The gating mechanism applied is adopted from Bidirectional GRU (Cho et al., 2014a) in our case. Thus the context-dependent representation of in- where − → n k and ← − n k are the forward and backward representation of inputs, respectively. The forward process can be illustrated as equations: are trainable weights and biases. Given the query vector qc k , attention weights over memory cells n k can be calculated by the equation: the readout vector is the sum of output memory matrix n k+1 with corresponding attention weights By summing query and readout vector together, we can get the output from the k-th hop.
Note oc k is also the query vector of the (k+1)-th hop.

Decoder
The decoder contains HMNs and an RNN controller, as shown in Figure 3. The controller controls the process of querying HMNs. In each time step, HMNs will generate: 1) a readout vector oc 1 , which is the output of the first hop in history memory, and 2) attention weights of the last hop in two memories, called history word distribution P his and knowledge tuple distribution P kb . The readout vector is concatenated with h t to predict the vocabulary distribution P vocab . Formally, where W 7 is trainable weight matrix. We adopt a simple strategy ( Section 3.5 ) to select a word from three distributions P vocab , P his and P kb .

Heterogeneous Memory Networks
HMNs stacks two types of memory: 1) contextaware memory and 2) context-free memory. Dialog history is loaded into context-aware memory, and knowledge base triples are loaded into context-free memory. Firstly, HMNs accepts query vector as inputs, then walk through contextaware memories. The final output u k in the last hop will be employed to query context-free memory. Context-aware memory has been detailed in Section 3.1.1. Context-Free memory itself is end-to-end memory networks (Sukhbaatar et al., 2015). Compared with our context-aware memory, it has no gating mechanism. The input to the memory is the summed vectors in each knowledge triple, as depicted in the right part of Figure 2. Each hop owns randomly initialized embedding matrix C k We denote memory slots as m k . It accepts a query vector and then follows the same process 5 to 7, the output u k can be obtained.

Controller
We adopt GRU as our controller. It accepts the output c from encoder as initial hidden state h 0 . In each time step, it takes the previous generated word g t−1 's embedding E(g t−1 ) and last time hidden state h t−1 as inputs. Formally: then h t is used to query the HMNs.

Copy Mechanism
We adopt copy mechanism to copy words from memories. Attention weights in the last hop of the two memories,P kb and P his will be the probability of the target word from those memories. If the target word does not appear in inputs, the position index will be the last position in memories, which is a sentinel added in preprocessing stage.

Joint Learning
To learn the distribution of three vocabularies P vocab , P kb and P his in each time step, the loss in the t-th time step is the negative log-likelihood of the predict probability of the target word for that time step. Formally: Note that p it means the t-th word's probability in i ∈ {P vocab , P kb , P his } .

Word Selection Strategy
In our case, if words with the highest probability in P his and P kb vocabularies are not on sentinel positions, we directly compare the probability of each word and select the higher one. If one of the vocabularies points to the sentinel position, the model will select the word with the highest probability in the other vocabulary. At last, if both vocabularies get to sentinel positions, the word from P vocab will be selected.

Datasets
As the proposed approach is quite general, the model can be applied to any task-oriented dialogue datasets with conversation and knowledge base data. To evaluate and compare the results with the state-of-the-art methods in multiple dimensions, we choose three popular task conversation datasets including DSTC 2, Key-Value Retrieval dataset and the (6) dialog bAbI tasks. Table  1 shows the statistics of datasets.
• Key-Value Retrieval dataset . This dataset releases a corpus of 3,031 multi-turn dialogues. The dialogues consist of three different domains: calendar scheduling, weather information retrieval, and point-of-interest navigation.
• The (6) dialog bAbI tasks (Bordes and Weston, 2016). The (6) dialog bAbI tasks are a set of five subtasks within the goal-oriented context of restaurant reservations. Conversations are grounded with an underlying knowledge base of restaurants and their properties (location, type of cuisine, etc.). As task 1 and 2 have been achieved very well, we only test our model on task 3 to 5 and their OOV(outof-vocabulary), where entities (e.g. restaurant names) in test sets may not have been able to see during training.
• The Dialog State Tracking Challenge 2 (DSTC 2). DSTC 2 is a research challenge focused on improving the state-of-the-art in tracking the state of spoken dialogue systems. DSTC 2's training dialogues were gathered using Amazon Mechanical Turk related to restaurant search.
For all datasets, we employ the original conversation and knowledge base information only and drop the other labels e.g. slot filling labels. We take several metrics over all datasets to evaluate the performance on multiple dimensions. And to evaluate the context-aware memory networks, we also test the HMNs with only context-free memory on the dialog bAbI tasks.

Evaluation Method
To compare with the original datasets baselines, we apply evaluation methods on each datasets the same as datasets' original papers described in 4.1.
• Bilingual Evaluation Understudy (BLEU) (Papineni et al., 2002). BLEU has been widely employed in evaluating sequence generation including machine translation, text summarization, and dialogue systems. BLEU calculates the n-gram precision which is the fraction of n-grams in the candidate text which is present in any of the reference texts.
• F1 Score (F-measure): F1 evaluates the model's ability in terms of precision and recall, which is more comprehensive than just using precision or recall measure. We adopt F1 to evaluate if a model can extract information from a knowledge base precisely.
• Per-response accuracy and Per-dialog accuracy. Per-response and Per-dialog accuracy count the percentage of responses that are correct. Any incorrect words will make a response or a dialogue negative. Accuracy shows if the model is able to learn the distribution of reproducing factual details.

Baselines and Training Setup
The hyper-parameter settings are adopted as the best practice settings for each training set following the Madotto's (Madotto et al., 2018) and  best experimental results on baselines SEQ2SEQ and Mem2Seq. Detailed models and their settings are as follows: • Sequence to sequence. For SEQ2SEQ, we adopt one layer LSTMs as encoder and decoder. For Key-Value Retrieval dataset, hidden size is placed at 512 and the dropout rate is 0.3. On dataset bAbI, the hidden size and dropout rate are 128 and 0.1 for task 3, 256 and 0.1 for task 4 and 5. Learning rates are set to 0.001 for bAbI and 0.0001 for DSTC 2 and Key-Value Retrieval dataset.
• SEQ2SEQ + Attention. We adopt the attention mechanism (Luong et al., 2015a) commonly used in neural machine translation. On dataset bAbI, hidden size and the dropout rate are 256 and 0.1 for task 3 and 4, 128 and 0.1 for task 5. For Key-Value Retrieval dataset, hidden size and dropout rate are 512 and 0.3. On the DSTC 2 task, hidden size is set to 353 and word embedding size is 300 (same with original work).
• Mem2Seq. Except 128 in task 3, hidden size in other tasks is 256. The dropout rate is set to 0.2 in task 3, 4 and Key-Value Retrieval dataset, 0.1 in task 5 and DSTC 2 dataset. We adopt three hops in DSTC 2 and Key-Value Retrieval dataset.
To test the performance of context-aware memory, we apply other context-free to encode dialogue history instead of contextaware memory in HMNs. All the other structure and parameter settings are the same as HMNs in this model.
All models are tested with various hyperparameter settings to get their best performance, e.g. hidden size selected from [64,128,256,512]. Note that settings from datasets are also tested like SEQ2SEQ + Attention's hidden size is 353 on Key-Value Retrieval dataset.
During the training, all experiments employ the teacher-forcing scheme, feeding the gold target of last time or highest probability word into decoder with probability 50%. We also randomly mask input with UNK according to the dropout rate.

Results and Analysis
The best results of the baselines and HMNs are gathered into tables and figures. Table 2 show the result of models on Key-Value Retrieval dataset. Except for F1 scores on Calendar Scheduling, HMNs get significantly better results on all benchmarks comparing the state-of-the-art models. HMMs' BLEU score is even higher than human results which are reported in . Results show our model's outstanding performance in generating a fluent and accurate response in most tasks. Examples generated by our approach and baselines are given in Table 5. These two examples are randomly selected from all generated sentences. Comparing the generated sentences by humans, although entities and sentences are different with gold answer in example one, our approach is able to produce more fluent and accurate sentences. However, the result on task weather forecasting neither HMNs and Mem2Seq can outperform SEQ2SEQ. We will discuss it in the next section. Table 4 shows our model gets the best F1 score on dataset DSTC 2, while SEQ2SEQ with attention gets the best BLEU result. Table 3 shows results of models on bAbI tasks. HMNs and Mem2Seq adopt one hop attention only and note that all results are the best performance of each model in 100 epochs. HMNs achieved the best results on most tasks except T5. HMNs-CFO also outperforms the other models. This demonstrates that both training multiple distributions over heterogeneous information and employment of context-aware memory benefit the end-to-end dialogue system. The improvements in per-dialogue accuracy on out-of-vocabulary tests are even more significant. Figure 4 shows the changes of HMNs and HMNs-CFO's total loss across time. HMNs learns significantly faster.
Though automatic metrics cannot really examine human beings' diversified expression, existing dialogue systems aim at generating sentence by learning the patterns of training data, so we believe BLEU is still a metric of great concern in comparing the similar models' ability in learning the sentence patterns. Though human results show endto-end machines have still a long way to go (60.7 to 43.1). Compared to other models, HMNs sig-

Context-Aware Memory
To show whether context-aware memory benefits conversation learning, on bAbI tasks, we also tested HMNs-CFO memory only. From Table 3, we observe that HMNs-CFO is significantly better than original Mem2Seq as well as SEQ2SEQ + attention in several results and only loses slightly on task 4 (89.3 to 90.5). One reason is that one memory is difficult to learn best distribution over different sources. Respectively encoding sequential dialogue history and grounding knowledge can learn two better distributions than one general but not best distribution. This also indicates that using the query vector generated by history memory to retrieve information in knowledge base memory is reasonable. As the HMNs model get the best results in all tasks except one, in addition the results of training speed of HMNs and HMNs-CFO (Figure 4), the context-aware memory is clearly to learn representation of the dialogue history much better and faster and also demonstrates that the importance of incorporating context information for dialogue systems. HMNs outperform the HMNs-CFO not only on BLEUs but also entity F1 on most tasks, showing building a good representation of dialogue history benefits knowledge reasoning, and help to improve the context-free memory by issuing a good query vector.
From above all, we can conclude that both stacked memory networks architecture and using context-aware memory to load sequential information can improve the performance of retrieving knowledge and generating sentences.

Shortcomings
From the results in Table 2, we note that HMNs and Mem2Seq failed on weather forecasting task. We analysed the average knowledge pairs of weather forecasting tasks and find it near three times the knowledge pairs of the other two tasks. Then we carried out another experiment that first narrows the KB candidates by performing a matching preprocessing operation, and the Weather Ent. F1 result of our method will climb to more than 48 which is the best. This may indicates that this kind of memory networks may have difficulties in handling large scale knowledge base. So perform a matching operation to narrow the candidate knowledge space is critical in a real-world large scale knowledge base.
And in this paper, we only show sequential data and knowledge triples data. For more types of information to integrate, model needs to add other memory networks, e.g. graph neural networks augmented memory networks  for graph structured data.

Conclusion
In this paper, we propose a model that is able to incorporate heterogeneous information in an endto-end dialogue system. The model applies Heterogeneous Memory Networks (HMNs) to model sequential history and structured database. Results on several datasets show model can significantly improve the performance of generating the response. Our proposed context-aware memory networks show outstanding performance in learning the distribution over dialogue history and retrieving knowledge. We present the possibility of efficiently using various structured data in end-toend task-oriented dialogue without any extra labeling and module training.