Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Oriented Dialog Systems

End-to-end task-oriented dialog systems usually suffer from the challenge of incorporating knowledge bases. In this paper, we propose a novel yet simple end-to-end differentiable model called memory-to-sequence (Mem2Seq) to address this issue. Mem2Seq is the first neural generative model that combines the multi-hop attention over memories with the idea of pointer network. We empirically show how Mem2Seq controls each generation step, and how its multi-hop attention mechanism helps in learning correlations between memories. In addition, our model is quite general without complicated task-specific designs. As a result, we show that Mem2Seq can be trained faster and attain the state-of-the-art performance on three different task-oriented dialog datasets.


Introduction
Task-oriented dialog systems help users to achieve specific goals with natural language such as restaurant reservation and schedule arrangement. Traditionally, they have been built with several pipelined modules: language understanding, dialog management, knowledge query, and language generation (Williams and Young, 2007;Hori et al., 2009;Lee et al., 2009;Levin et al., 2000;Young et al., 2013). Moreover, the ability to query external Knowledge Bases (KBs) is essential in taskoriented dialog systems, since the responses are guided not only by the dialog history but also by the query results (e.g. Table 1). However, despite the stability of such pipelined systems via combining domain-specific knowledge and slot-filling * * These two authors contributed equally. Seq2Seq I have a away from away would you like the address +Attn I have a listing for a place that serves tea that is 5 miles away Ptr-Unk There is a away you like would you like more info

Mem2Seq
The nearest tea shop is Palo Alto Cafe located 4 miles away, would you like directions there?
GOLD Palo Alto Cafe is 4 miles away and serves coffee and tea. Do you want the address?
2th Turn The is at at +Attn The address is 329 El Camino Real and it's 3 miles away and there is no traffic Ptr-Unk Palo Alto Cafe is 4 miles away and PAD is no traffic Mem2Seq Palo Alto Cafe is 4 miles away at 436 Alger Drive GOLD Palo Alto is located at 436 Alger Dr. techniques, modeling the dependencies between modules is complex and the KB interpretation requires human effort.
Recently, end-to-end approaches for dialog modeling, which use recurrent neural networks (RNN) encoder-decoder models, have shown promising results (Serban et al., 2016;Wen et al., 2017;Zhao et al., 2017). Since they can directly map plain text dialog history to the output responses, and the dialog states are latent, there is no need for hand-crafted state labels. Moreover, attention-based copy mechanism (Gulcehre et al., 2016;Eric and Manning, 2017) have been recently introduced to copy words directly from the input sources to the output responses. Using such mechanism, even when unknown tokens appear in the dialog history, the models are still able to produce correct and relevant entities.
However, although the above mentioned approaches were successful, they still suffer from two main problems: 1) They struggle to effectively incorporate external KB information into the RNN hidden states (Sukhbaatar et al., 2015), since RNNs are known to be unstable over long sequences. 2) Processing long sequences is very time-consuming, especially when using attention mechanisms.
On the other hand, end-to-end memory networks (MemNNs) are recurrent attention models over a possibly large external memory (Sukhbaatar et al., 2015). They write external memories into several embedding matrices, and use query vectors to read memories repeatedly. This approach can memorize external KB information and rapidly encode long dialog history. Moreover, the multi-hop mechanism of MemNN has empirically shown to be essential in achieving high performance on reasoning tasks (Bordes and Weston, 2017). Nevertheless, MemNN simply chooses its responses from a predefined candidate pool rather than generating word-by-word. In addition, the memory queries need explicit design rather than being learned, and the copy mechanism is absent.
To address these problems, we present a novel architecture that we call Memory-to-Sequence (Mem2Seq) to learn task-oriented dialogs in an end-to-end manner. In short, our model augments the existing MemNN framework with a sequential generative architecture, using global multihop attention mechanisms to copy words directly from dialog history or KBs. We summarize our main contributions as such: 1) Mem2Seq is the first model to combine multi-hop attention mechanisms with the idea of pointer networks, which allows us to effectively incorporate KB information. 2) Mem2Seq learns how to generate dynamic queries to control the memory access. In addition, we visualize and interpret the model dynamics among hops for both the memory controller and the attention. 3) Mem2Seq can be trained faster and achieve state-of-the-art results in several task-oriented dialog datasets.

Model Description
Mem2Seq 1 is composed of two components: the MemNN encoder, and the memory decoder as shown in Figure 1. The MemNN encoder creates a vector representation of the dialog history. Then the memory decoder reads and copies the memory to generate a response. We define all the words in the dialog history as a sequence of tokens X = {x 1 , . . . , x n , $}, where $ is a special charter used as a sentinel, and the KB tuples as B = {b 1 , . . . , b l }. We further define U = [B; X] as the concatenation of the two sets X and B, Y = {y 1 , . . . , y m } as the set of words in the expected system response, and P T R = {ptr 1 , . . . , ptr m } as the pointer index set: where u z ∈ U is the input sequence and n + l + 1 is the sentinel position index.

Memory Encoder
Mem2Seq uses a standard MemNN with adjacent weighted tying (Sukhbaatar et al., 2015) as an encoder. The input of the encoder is word-level information in U . The memories of MemNN are represented by a set of trainable embedding matrices C = {C 1 , . . . , C K+1 }, where each C k maps tokens to vectors, and a query vector q k is used as a reading head. The model loops over K hops and it computes the attention weights at hop k for each memory i using: where C k i = C k (x i ) is the memory content in position i, and Softmax(z i ) = e z i /Σ j e z j . Here, p k is a soft memory selector that decides the memory relevance with respect to the query vector q k . Then, the model reads out the memory o k by the weighted sum over C k+1 2 , Then, the query vector is updated for the next hop by using q k+1 = q k + o k . The result from the encoding step is the memory vector o K , which will become the input for the decoding step.

Memory Decoder
The decoder uses RNN and MemNN. The MemNN is loaded with both X and B, since we use both dialog history and KB information to generate a proper system response. A Gated Recurrent Unit (GRU) (Chung et al., 2014), is used as a dynamic query generator for the MemNN. At each decoding step t, the GRU gets the previously generated word and the previous query as input, and it generates the new query vector. Formally: Then the query h t is passed to the MemNN which will produce the token, where h 0 is the encoder vector o K . At each time step, two distribution are generated: one over all the words in the vocabulary (P vocab ), and one over the memory contents (P ptr ), which are the dialog history and KB inofrmation. The first, P vocab , is generated by concatenating the first hop attention read out and the current query vector.
where W 1 is a trainable parameter. On the other hand, P ptr is generated using the attention weights at the last MemNN hop of the decoder: P ptr = p K t . Our decoder generates tokens by pointing to the input words in the memory, which is a similar mechanism to the attention used in pointer networks (Vinyals et al., 2015).
We designed our architecture in this way because we expect the attention weights in the first and the last hop to show a "looser" and "sharper" distribution, respectively. To elaborate, the first hop focuses more on retrieving memory information and the last one tends to choose the exact token leveraging the pointer supervision. Hence, during training all the parameters are jointly learned by minimizing the sum of two standard cross-entropy losses: one between P vocab (ŷ t ) and y t ∈ Y for the vocabulary distribution, and one between P ptr (ŷ t ) and ptr t ∈ P T R for the memory distribution.

Sentinel
If the expected word is not appearing in the memories, then the P ptr is trained to produce the sentinel token $, as shown in Equation 1. Once the sentinel is chosen, our model generates the token from P vocab , otherwise, it takes the memory content using the P ptr distribution. Basically, the sentinel token is used as a hard gate to control which distribution to use at each time step. A similar approach has been used in (Merity et al., 2017) to control a soft gate in a language modeling task. With this method, the model does not need to learn a gating function separately as in Gulcehre et al. (2016), and is not constrained by a soft gate function as in See et al. (2017).

Memory Content
We store word-level content X in the memory module. Similar to Bordes and Weston (2017), we add temporal information and speaker information in each token of X to capture the sequential dependencies. For example, "hello t1 $u" means "hello" at time step 1 spoken by a user.
On the other hand, to store B, the KB information, we follow the works of Miller et al.  Table 1: (The Westin, Distance, 5 miles). Thus, we sum word embeddings of the subject, relation, and object to obtain each KB memory representation. During decoding stage, the object part is used as the generated word for P ptr . For instance, when the KB tuple (The Westin, Distance, 5 miles) is pointed, our model copies "5 miles" as an output word. Notice that only a specific section of the KB, relevant to a specific dialog, is loaded into the memory. Task 1   2 3 4 5 DSTC2 In-Car Avg. User turns 4 6.5 6.4 3.5 12.9 6.7 2.6 Avg. Sys turns 6 9.5 9.9 3.5 18.   Table 2. The bAbI dialog includes five end-to-end dialog learning tasks in the restaurant domain, which are simulated dialog data. Task 1 to 4 are about API calls, refining API calls, recommending options, and providing additional information, respectively. Task 5 is the union of tasks 1-4. There are two test sets for each task: one follows the same distribution as the training set and the other has out-of-vocabulary (OOV) entity values that does not exist in the training set.
We also used dialogs extracted from the Dialog State Tracking Challenge 2 (DSTC2) with the refined version from Bordes and Weston (2017), which ignores the dialog state annotations. The main difference with bAbI dialog is that this dataset is extracted from real human-bot dialogs, which is noisier and harder since the bots made mistakes due to speech recognition errors or misinterpretations.
Recently, In-Car Assistant dataset has been released. which is a human-human, multi-domain dialog dataset collected from Amazon Mechanical Turk. It has three distinct domains: calendar scheduling, weather information retrieval, and point-of-interest navigation. This dataset has shorter conversation turns, but the user and system behaviors are more diverse. In addition, the system responses are variant and the KB information is much more complicated. Hence, this dataset requires stronger ability to interact with KBs, rather than dialog state tracking.

Training
We trained our model end-to-end using Adam optimizer (Kingma and Ba, 2015), and chose learning rate between [1e −3 , 1e −4 ]. The MemNNs, both encoder and decoder, have hops K = 1, 3, 6 to show the performance difference. We use simple greedy search and without any re-scoring techniques. The embedding size, which is also equivalent to the memory size and the RNN hidden size (i.e., including the baselines), has been selected between [64,512]. The dropout rate is set between [0.1, 0.4], and we also randomly mask some input words into unknown tokens to simulate OOV situation with the same dropout ratio. In all the datasets, we tuned the hyper-parameters with gridsearch over the validation set, using as measure to the Per-response Accuracy for bAbI dialog and DSTC2, and BLEU score for the In-Car Assistant.

Evaluation Metrics
Per-response/dialog Accuracy: A generative response is correct only if it is exactly the same as the gold response. A dialog is correct only if every generated responses of the dialog are correct, which can be considered as the task-completion rate. Note that Bordes and Weston (2017) tests their model by selecting the system response from predefined response candidates, that is, their system solves a multi-class classification task. Since Mem2Seq generates each token individually, evaluating with this metric is much more challenging for our model. BLEU: It is a measure commonly used for machine translation systems (Papineni et al., 2002), but it has also been used in evaluating dialog systems (Eric and Manning, 2017;Zhao et al., 2017) and chat-bots (Ritter et al., 2011;Li et al., 2016). Moreover, BLEU score is a relevant measure in task-oriented dialog as there is not a large variance between the generated answers, unlike open domain generation (Liu et al., 2016). Hence, we include BLEU score in our evaluation (i.e. using Moses multi-bleu.perl script). Entity F1: We micro-average over the entire set of system responses and compare the entities in plain text. The entities in each gold system response are selected by a predefined entity list. This metric evaluates the ability to generate relevant entities from the provided KBs and to capture the semantics of the dialog (Eric and Manning, 2017;Eric et al., 2017). Note that the original In-Car Assis-  Table 3: Per-response and per-dialog (in the parentheses) accuracy on bAbI dialogs. Mem2Seq achieves the highest average per-response accuracy and has the least out-of-vocabulary performance drop.

Experimental Results
We mainly compare Mem2Seq with hop 1,3,6 with several existing models: query-reduction networks (QRN, Seo et al. (2017)), end-toend memory networks (MemNN, Sukhbaatar et al. (2015)), and gated end-to-end memory networks (GMemNN, Liu and Perez (2017)). We also implemented the following baseline models: standard sequence-to-sequence (Seq2Seq) models with and without attention (Luong et al., 2015), and pointer to unknown (Ptr-Unk, Gulcehre et al. (2016)). Note that the results we listed in Table 3 and Table 4 for QRN are different from the original paper, because based on their released code, 3 we discovered that the per-response accuracy was not correctly computed. bAbI Dialog: In Table 3, we follow Bordes 3 We simply modified the evaluation part and reported the results. (https://github.com/uwnlp/qrn) and Weston (2017) to compare the performance based on per-response and per-dialog accuracy. Mem2Seq with 6 hops can achieve per-response 97.9% and per-dialog 69.6% accuracy in T5, and 84.5% and 2.3% for T5-OOV, which surpass existing methods by far. One can find that in T3 especially, which is the task to recommend restaurant based on their ranks, our model can achieve promising results due to the memory pointer. In terms of per-response accuracy, this indicates that our model can generalize well with few performance loss for test OOV data, while others have around 15-20% drop. The performance gain in OOV data is also mainly attributed to the use of copy mechanism. In addition, the effectiveness of hops is demonstrated in tasks 3-5, since they require reasoning ability over the KB information. Note that QRN, MemNN and GMemNN viewed bAbI dialog tasks as classification problems. Although their tasks are easier compared to our generative methods, Mem2Seq models can still overpass the performance. Finally, one can find that Seq2Seq and Ptr-Unk models are also strong baselines, which further confirms that generative methods can also achieve good performance in taskoriented dialog systems (Eric and Manning, 2017). Table 4, the Seq2Seq models from Eric and Manning (2017) and the rule-based from Bordes and Weston (2017) are reported. Mem2Seq has the highest 75.3% entity F1 score and an high of 55.3 BLEU score. This further confirms that Mem2Seq can perform well in retrieving the correct entity, using the multiple hop mechanism without losing language modeling. Here, we do not report the results using match type (Bordes and Weston, 2017) or entity type (Eric and Manning, 2017) feature, since this meta-information are not commonly available and we want to have an evaluation on plain input output couples. One can also find out that, Mem2Seq comparable perresponse accuracy (i.e. 2% margin) among other existing solution. Note that the per-response accuracy for every model is less than 50% since the dataset is quite noisy and it is hard to generate a response that is exactly the same as the gold one.

DSTC2: In
In-Car Assistant: In Table 5, our model can achieve highest 12.6 BLEU score. In addition, Mem2Seq has shown promising results in terms of Entity F1 scores (33.4%), which are, in general, much higher than those of other baselines. Note that the numbers reported from Eric et al. (2017) are not directly comparable to ours as we mention below. The other baselines such as Seq2Seq or Ptr-Unk especially have worse performances in this dataset since it is very inefficient for RNN methods to encode longer KB information, which is the advantage of Mem2Seq.
Furthermore, we observe an interesting phenomenon that humans can easily achieve a high entity F1 score with a low BLEU score. This implies that stronger reasoning ability over entities (hops) is crucial, but the results may not be similar to the golden answer. We believe humans can produce good answers even with a low BLEU score, since there could be different ways to express the same concepts. Therefore, Mem2Seq shows the potential to successfully choose the correct entities.
Note that the results of KV Retrieval Net baseline reported in Table 5 come from the original paper (Eric et al., 2017) of In-Car Assistant, where they simplified the task by mapping the expression of entities to a canonical form using named entity recognition (NER) and linking. Hence the evaluation is not directly comparable to our system. For example, their model learned to generate responses such as "You have a football game at foot- ball time with football party," instead of generating a sentence such as "You have a football game at 7 pm with John." Since there could be more than one football party or football time, their model does not learn how to access the KBs, but it rather learns the canonicalized language model. Time Per-Epoch: We also compare the training time 4 in Figure 2. The experiments are set with batch size 16, and we report each model with the hyper-parameter that can achieved the highest performance. One can observe that the training time is not that different for short input length (bAbI dialog tasks 1-4) and the gap becomes larger as the maximal input length increases. Mem2Seq is around 5 times faster in In-Car Assistant and DSTC2 compared to Seq2Seq with attention. This difference in training efficiency is mainly attributed to the fact that Seq2Seq models have input sequential dependencies which limit any parallelization. Moreover, it is unavoidable for Seq2Seq models to encode KBs, instead Mem2Seq only encodes with dialog history.

Analysis and Discussion
Memory Attention: Analyzing the attention weights has been frequently used to show the memory read-out, since it is an intuitive way to understand the model dynamics. Figure 8 shows the attention vector at the last hop for each generated token. Each column represents the P ptr vector at the corresponding generation step. Our model has a sharp distribution over the memory, which im- plies that it is able to select the right token from the memory. For example, the KB information "270 altarie walk" was retrieved at the sixth step, which is an address for "civic center garage". On the other hand, if the sentinel is triggered, then the generated word comes from vocabulary distribution P vocab . For instance, the third generation step triggered the sentinel, and "is" is generated from the vocabulary as the word is not present in the dialog history.
Multiple Hops: Mem2Seq shows how multiple hops improve the model performance in several datasets. Task 3 in the bAbI dialog dataset serves as an example, in which the systems need to recommend restaurants to users based on restaurant ranking from highest to lowest. Users can reject the recommendation and the system has to reason over the next highest restaurant. We found out there are two common patterns between hops among different samples: 1) the first hop is usually used to score all the relevant memories and   retrieve information; 2) the last hop tends to focus on a specific token and makes mistakes when the attention is not sharp. Such mistakes can be attributed to lack of hops, for some samples. For more information, we report two figures in the supplementary material. Query Vectors: In Figure 4, the principal component analysis of Mem2Seq queries vectors is shown for different hops. Each dot is a query vector h t during each decoding time step, and it has its corresponding generated word y t . The blue dots are the words generated from P vocab , which triggered the sentinel, and orange ones are from P ptr . One can find that in (a) hop 1, there is no clear separation of two different colors but each of which tends to group together. On the other hand, the separation becomes clearer in (b) hop 6 as each color clusters into several groups such as location, cuisine, and number. Our model tends to retrieve more information in the first hop, and points into the memories in the last hop.
Examples: Table 1 and 6 show the generated responses of different models in the two test set samples from the In-Car Assistant dataset. We report examples from this dataset since their answers are more human-like and not as structured and repetitive as others. Seq2Seq generally cannot produce related information, and sometimes fail in language modeling. Instead, using attention helps with this issue, but it still rarely produces the correct entities. For example, Seq2Seq with attention generated 5 miles in Table 1 but the correct one is 4 miles. In addition, Ptr-Unk often cannot copy the correct token from the input, as shown by "PAD" in Table 1. On the other hand, Mem2Seq is able to produce the correct responses in this two examples. In particular in the navigation domain, shown in Table 1, Mem2Seq produces a different but still correct utterance. We report further examples from all the domains in the supplementary material.
Discussions: Conventional task-oriented dialog systems (Williams and Young, 2007), which are still widely used in commercial systems, require a multitude of human efforts in system designing and data collection. On the other hand, although end-to-end dialog systems are not perfect yet, they require much less human interference, especially in the dataset construction, as raw conversational text and KB information can be used directly without the need of heavy preprocessing (e.g. NER, dependency parsing). To this extent, Mem2Seq is a simple generative model that is able to incorporate KB information with promising generalization ability. We also discovered that the entity F1 score may be a more comprehensive evaluation metric than per-response accuracy or BLEU score, as humans can normally choose the right entities but have very diversified responses. Indeed, we want to highlight that humans may have a low BLEU score despite their correctness because there may not be a large n-gram overlap between the given response and the expected one. However, this does not imply that there is no correlation between BLEU score and human evaluation.  , 2017, 2018) have also shown good results in such tasks. In each of these architectures, the output is produced by generating a sequence of tokens, or by selecting a set of predefined utterances. Sequence-to-sequence (Seq2Seq) models have also been used in task-oriented dialog systems (Zhao et al., 2017). These architectures have better language modeling ability, but they do not work well in KB retrieval. Even with sophisticated attention models (Luong et al., 2015;Bahdanau et al., 2015), Seq2Seq fails to map the correct entities to the generated input. To alleviate this problem, copy augmented Seq2Seq models Eric and Manning (2017), were used. These models outperform utterance selection methods by copying relevant information directly from the KBs. Copy mechanisms has also been used in question answering tasks ( Less related to dialog systems, but related to our work, are the memory based decoders and the nonrecurrent generative models: 1) Mem2Seq query generation phase used to access our memories can be seen as the memory controller used in Memory Augmented Neural Networks (MANN) (Graves et al., 2014(Graves et al., , 2016. Similarly, memory encoders have been used in neural machine translation (Wang et al., 2016), and meta-learning application (Kaiser et al., 2017). However, Mem2Seq differs from these models as such: it uses multi-hop attention in combination with copy mechanism, whereas other models use a single matrix representation. 2) non-recurrent generative models (Vaswani et al., 2017), which only rely on selfattention mechanism, are related to the multi-hop attention mechanism used in MemNN.

Conclusion
In this work, we present an end-to-end trainable Memory-to-Sequence model for task-oriented dialog systems. Mem2Seq combines the multi-hop attention mechanism in end-to-end memory networks with the idea of pointer networks to incorporate external information. We empirically show our model's ability to produce relevant answers using both the external KB information and the predefined vocabulary, and visualize how the multihop attention mechanisms help in learning correlations between memories. Mem2Seq is fast, general, and able to achieve state-of-the-art results in three different datasets.    Table 8: Selected hyper-parameters in each datasets for different hops. The values is the embedding dimension and the GRU hidden size, and the values between parenthesis is the dropout rate. For all the models we used learning rate equal to 0.001, with a decay rate of 0.5.    six R number resto paris expensive british 4stars paris R location resto paris expensive british 4stars resto paris expensive british 4stars address R address resto paris expensive british 4stars british R cuisine resto paris expensive british 4stars resto paris expensive british 4stars phone R phone resto paris expensive british 4stars resto paris expensive british 2stars R rating 2 expensive R price resto paris expensive british 2stars six R number resto paris expensive british 2stars paris R location resto paris expensive british 2stars resto paris expensive british 2stars address R address resto paris expensive british 2stars british R cuisine resto paris expensive british 2stars resto paris expensive british 2stars phone R phone resto paris expensive british 2stars resto paris expensive british 3stars R rating 3 expensive R price resto paris expensive british 3stars six R number resto paris expensive british 3stars paris R location resto paris expensive british 3stars resto paris expensive british 3stars address R address resto paris expensive british 3stars british R cuisine resto paris expensive british 3stars resto paris expensive british 3stars phone R phone resto paris expensive british 3stars resto paris expensive british 5stars R rating 5 expensive R price resto paris expensive british 5stars six R number resto paris expensive british 5stars paris R location resto paris expensive british 5stars resto paris expensive british 5stars address R address resto paris expensive british 5stars british R cuisine resto paris expensive british 5stars resto paris expensive british 5stars phone R phone resto paris expensive british  Seq2seq there is no snow in the forecast for this week +Att there is no snow in the forecast for durham for durham Ptr-Unk yes there is no snow in durham on week Mem2Seq no snow, the forecast does not predict any snow on Durham GOLD Durham does not have any snow predicted this week GOLD Today is gonna be warm in Oakland Seq2seq It is not snow in today in +att No there will be snow in Oakland today Ptr-Unk Today is gonna be PAD in Mem2Seq it will not snow in oakland today