A Working Memory Model for Task-oriented Dialog Response Generation

Recently, to incorporate external Knowledge Base (KB) information, one form of world knowledge, several end-to-end task-oriented dialog systems have been proposed. These models, however, tend to confound the dialog history with KB tuples and simply store them into one memory. Inspired by the psychological studies on working memory, we propose a working memory model (WMM2Seq) for dialog response generation. Our WMM2Seq adopts a working memory to interact with two separated long-term memories, which are the episodic memory for memorizing dialog history and the semantic memory for storing KB tuples. The working memory consists of a central executive to attend to the aforementioned memories, and a short-term storage system to store the “activated” contents from the long-term memories. Furthermore, we introduce a context-sensitive perceptual process for the token representations of dialog history, and then feed them into the episodic memory. Extensive experiments on two task-oriented dialog datasets demonstrate that our WMM2Seq significantly outperforms the state-of-the-art results in several evaluation metrics.


Introduction
Task-oriented dialog systems, such as hotel booking or technical support service, help users to achieve specific goals with natural language. Compared with traditional pipeline solutions (Williams and Young, 2007;Young et al., 2013;Wen et al., 2017), end-to-end approaches recently gain much attention (Zhao et al., 2017;Eric and Manning, 2017a;Lei et al., 2018), because they directly map dialog history to the output responses and consequently reduce human effort for modular designs and hand-crafted state labels. To effectively incorporate KB information and perform knowledge- * Corresponding Author based reasoning, memory augmented models have been proposed (Bordes et al., 2017;Seo et al., 2017;Eric and Manning, 2017b;Madotto et al., 2018;Raghu et al., 2018;Reddy et al., 2019;Wu et al., 2019). Bordes et al. (2017) and Seo et al. (2017) attended to retrieval models, lacking the ability of generation, while others incorporated the memory (i.e. end-to-end memory networks, abbreviated as MemNNs, Sukhbaatar et al. (2015)) and copy mechanism (Gu et al., 2016) into a sequential generative architecture. However, most models tended to confound the dialog history with KB tuples and simply stored them into one memory. A shared memory forces the memory reader to reason over the two different types of data, which makes the task harder, especially when the memory is large. To explore this problem, Reddy et al. (2019) very recently proposed to separate memories for modeling dialog context and KB results. In this paper, we adopt working memory to interact with two longterm memories. Furthermore, compared to Reddy et al. (2019), we leverage the reasoning ability of MemNNs to instantiate the external memories.
Our intuition comes from two aspects. First, psychologists tend to break down the long-term memory 1 into episodic memory for events (e.g. visual and textual perceptual inputs) and semantic memory for facts (world knowledge, such as KB information) as not all memory of experiences is the same (Gazzaniga and Ivry, 2013). Second, a successful task-oriented dialog system needs more intelligence, and recent works suggest that a critical component of intelligence may be working memory (Sternberg and Sternberg, 2016). Hence, leveraging the knowledge from psychological studies (Baddeley and Hitch, 1974;Baddeley, 2000;Dosher, 2003), we explore working memory for the dialog response generation. Our contributions are summarized as follows: Firstly, inspired by the psychological studies on working memory, we propose the WMM2Seq for dialog generation which separates the storage of dialog history and KB information by using the episodic and semantic memories and then leverages the working memory to interact with them.
Secondly, we leverage two kinds of transformations (CNN and biGRU) to incorporate the context information for better token representations. This procedure can be seen as a part of perceptual processes before the episodic memory storage, and can alleviate the Out-Of-Vocabulary (OOV) problem.
Finally, our WMM2Seq outperforms the existing methods on several evaluation metrics in two task-oriented dialog datasets and shows a better reasoning ability in the OOV situation. Figure 1 illustrates the flow of our WMM2Seq for dialog response generation. WMM2Seq can be seen as an encoder-decoder model, where decoder is the Working Memory (WM) which could interact with two long-term memories (the episodic memory memorizing dialog history and semantic memory storing KB information). As MemNN is well-known for its multiple hop reasoning ability, we instantiate the encoder and the two memories with three different MemNNs (MemNN Encoder, E-MemNN and S-MemNN). Furthermore, we augment E-MemNN and S-MemNN with copy mechanism from where we need to copy tokens or entities. The encoder encodes the dialog history to obtain the high-level signal, a distributed intent vector. The WM consists of a Short-Term Storage system (STS) and a Central-EXE including an Attention Controller (Attn-Ctrl) and a rule-based word selection strategy. The Attn-Ctrl dynamically generates the attention control vector to query and reason over the two long memories and then stores three "activated" distributions into STS. Finally a generated token is selected from the STS under the word selection strategy at each decoder step.

Model Description
The symbols are defined in Table 1, and more details can be found in the supplementary material. We omit the subscript E or S 2 , following Madotto et al. (2018) to define each pointer index set: Symbol Definition xi or yi a token in the dialog history or system response $ a special token used as a sentinel (Madotto et al., 2018) X X = {x1, . . . , xn, $}, the dialog history Y Y = {y1, · · · , ym}, the expected response bi one KB tuple, actually the corresponding entity B B = {b1, · · · , b l , $}, the KB tuples P T RE = {ptrE,1, · · · , ptrE,m}, dialog pointer index set. P T RE supervised information for copying words in dialog history P T RS = {ptrS,1, · · · , ptrS,m}, KB pointer index set. P T RS supervised information for copying entities in KB tuples Table 1: Notation Table. where xb z ∈ X or B is the dialog history or KB tuples according to the subscript (E or S) and n xb + 1 is the sentinel position index as n xb is equal to the dialog history length n or the number of KB triples l. The idea behind Eq. 1 is that we can obtain the positions of where to copy by matching the target text with the dialog history or KB information. Furthermore, we hope this provides the model with an accurate guidance of how to activate the two long-term memories.

MemNN Encoder
Here, on the context of our task, we give a brief description of K-hop MemNN with adjacent weight tying and more details can be found in (Sukhbaatar et al., 2015). The memory of MemNN is represented by a set of trainable embedding matrices C = {C 1 , . . . , C K+1 }. Given input tokens in the dialog history X, MemNN first writes them into memories by Eq. 2 and then uses a query to iteratively read from them with multi hops to reason about the required response by Eq. 3 and Eq. 4. For each hop k, we update the query by Eq. 5 and the initial query is a learnable vector as like Yang et al. (2016). The MemNN encoder finally outputs a user intent vector o K .
To incorporate the context information, we explore two context-aware transformation TRANS(·) by replacing Eq. 2 with A k i = TRANS(C k (x i )), which is defined as follows:  or where h i is the context-aware representation, and φ e is a trainable embedding function. We combine MemNNs with TRANS(·) to alleviate the OOV problem when reasoning about memory contents.

Working Memory Decoder
Inspired by the studies on the working memory, we design our decoder as an attentional control system for dialog generation which consists of the working memory and two long-term memories. As shown in Figure 1, we adopt the E-MemNN to memorize the dialog history X as described in Section 2.1, and then store KB tuples into the S-MemNN without TRANS(·). We also incorporate additional temporal information and speaker information into dialog utterances as (Madotto et al., 2018) and adopt a (subject, relation, object) representation of KB information as (Eric and Manning, 2017b). More details can be found in the supplementary material.
Having written dialog history and KB tuples into E-MemNN and S-MemNN, we then use the WM to interact with them (to query and reason over them) to generate the response. At each decoder step, the Attn-Ctrl, instantiated as a GRU, dynamically generates the query vector q t as follows: Here, query q t is used to access E-MemNN activating the final query q E = o K E , vocabulary distribution P vocab by Eq. 9 and copy distribution for dialog history P E·ptr . When querying S-MemNN, we consider the dialog history by using query q t = q E + q t and then obtain the copy distribution for KB entities P S·ptr . The two copy distributions are obtained by augmenting MemNNs with copy mechanism that is P E·ptr = p K E,t and P S·ptr = p K S,t .
Now, three distributions, P vocab , P E·ptr and P S·ptr , are activated and moved into the STS, and then a proper word is generated from the activated distributions. We here use a rule-based word selection strategy by extending the sentinel idea in (Madotto et al., 2018), which is shown in Figure 1. If the expected word is not appearing either in the episodic memory or the semantic memory, the two copy pointers are trained to produce the sentinel token and our WMM2Seq generates the token from P vocab ; otherwise, the token is generated by copying from either the dialog history or KB tuples and this is done by comparing the two copy distributions. We always select the other distribution if one of the two distributions points to the sentinel or select to copy the token corresponding to the biggest probability of the two distributions. Hence, during the training stage, all the parameters are jointly learned by minimizing the sum of three standard cross-entropy losses with the corresponding targets (Y , P T R E and P T R S ).

Task
Ptr

Experiments
We conduct experiments on the simulated bAbI Dialogue dataset (Bordes et al., 2017) and the Dialog State Tracking Challenge 2 (DSTC2) (Henderson et al., 2014). We actually adopt the refined version of DSTC2 from Bordes et al. (2017) and their statistics are given in the supplementary material. Our model is trained end-to-end using Adam optimizer (Kingma and Ba, 2014), and the responses are generated using greedy search without any rescoring techniques. The shared size of embedding and hidden units is selected from [64, 512] and the default hop K = 3 is used for all MemNNs. The learning rate is simply fixed to 0.001 and the dropout ratio is sampled from [0.1, 0.4]. Furthermore, we randomly mask some memory cells with the same dropout ratio to simulate the OOV situation for both episodic and semantic memories. The hyper-parameters for best models are given in the supplementary material.

Results and Analysis
We use Per-response/dialog Accuracy (Bordes et al., 2017), BLEU (Papineni et al., 2002) and Entity F1 (Madotto et al., 2018) to compare the performance of different models. And the baseline models are Seq2Seq+Attn (Luong et al., 2015), Pointer to Unknown (Ptr-Unk, Gulcehre et al. (2016)), Mem2Seq (Madotto et al., 2018), Hierarchical Pointer Generator Memory Network (HyP-MN, Raghu et al. (2018)) and Global-to-Local Memory Pointer (GLMP, Wu et al. (2019)). Automatic Evaluation: The results on the bAbI dialog dataset are given in Table 2. We can see that our model does much better on the OOV situation and is on par with the best results on T5. Moreover, our model can perfectly issue API calls (task 1), update API calls (task 2) and provide extra information (task 4). As task 5 is a combination of tasks 1-4, our best performance on T5-OOV exhibits the powerful reasoning ability to the unseen  dialog history and KB tuples. And this reasoning ability is also proved by the performance improvements on the DSTC2 dataset according to several metrics in Table 3. Especially, a significant improvement on entity F1 scores indicates that our model can choose the right entities and incorporate them into responses more naturally (with highest BLEU scores). Furthermore, there is no significant difference between the two kinds of the transformation TRANS(·). Ablation Study: To better understand the components used in our model, we report our ablation studies from three aspects. First, we remove the context-sensitive transformation TRANS(·) and then find significant performance degradation. This suggests that perceptual processes are a necessary step before storing perceptual information (the dialog history) into the episodic memory and it is important for the performance of working memory. Second, we find that WMM2Seq outperforms Mem2Seq, which uses a unified memory to store dialog history and KB information. We can safely conclude that the separation of context memory and KB memory benefits the performance, as WMM2Seq performs well with less parameters than Mem2Seq on task 5. Finally, we additionally analysis how the multi-hop attention mechanism helps by showing the performance differences between the hop K = 1 and the default hop K = 3. Though multi-hop attention strengthens the reasoning ability and improves the results, we find that the performance difference between the hops K = 1 and K = 3 is not so obvious as shown in   (Madotto et al., 2018;Wu et al., 2019). Furthermore, our model performs well even with one hop, which we mainly attribute to the reasoning ability of working memory. The separation of memories and stacking S-MemNN on E-MemNN also help a lot, because the whole external memory, consisting of the episodic and semantic memories, can be seen as a multi-hop (two-level) structure (the first level is the episode memory and the second level is the semantic memory). Attention Visualization: As an intuitive way to show the model's dynamics, attention weight visualization is also used to understand how the Central-EXE controls the access to the two long-term memories (E-MemNN and S-MemNN). Figure 2 shows the episodic and semantic memory attention vectors at the last hop for each generated token. Firstly, our model generates a different but still correct response as the customer wants a moderately priced restaurant in the west and does not care about the type of food. Secondly, the generated response has tokens from the vocabulary (e.g. "is" and "a"), dialog history (e.g. "west" and "food") and KB information (e.g. "saint johns chop house" and "british"), indicating that our model learns to interact well with the two long-term memories by two sentinels.
Human Evaluation: Following the methods in (Eric and Manning, 2017b;Wu et al., 2019), we report human evaluation of the generated responses in Table 4. We adopt Mem2Seq as the baseline for human evaluation considering its good performance and code release 3 . First we randomly select 100 samples from the DSTC2 test set, then generate the corresponding responses using WMM2Seq and Mem2Seq, and finally ask two human subjects to judge the quality of the generated responses according to the appropriateness and humanlikeness on a scale from 1 to 5. As shown in Table 4, WMM2Seq outperforms Mem2Seq in both measures, which is coherent to the automatic evaluation. More details about human evaluation are reported in the supplementary material.

Conclusion
We leverage the knowledge from the psychological studies and propose our WMM2Seq for dialog response generation. First, the storage separation of the dialog history and KB information is very important and we explore two context-sensitive perceptual processes for the word-level representations of the dialog history. Second, working memory is adopted to interact with the long-term memories and then generate the responses. Finally, the improved performance on two task-oriented datasets demonstrates the contributions from the separated storage and the reasoning ability of working memory. Our future work will focus on how to transfer the long-term memory across different tasks.