Memory Graph Networks for Explainable Memory-grounded Question Answering

We introduce Episodic Memory QA, the task of answering personal user questions grounded on memory graph (MG), where episodic memories and related entity nodes are connected via relational edges. We create a new benchmark dataset first by generating synthetic memory graphs with simulated attributes, and by composing 100K QA pairs for the generated MG with bootstrapped scripts. To address the unique challenges for the proposed task, we propose Memory Graph Networks (MGN), a novel extension of memory networks to enable dynamic expansion of memory slots through graph traversals, thus able to answer queries in which contexts from multiple linked episodes and external knowledge are required. We then propose the Episodic Memory QA Net with multiple module networks to effectively handle various question types. Empirical results show improvement over the QA baselines in top-k answer prediction accuracy in the proposed task. The proposed model also generates a graph walk path and attention vectors for each predicted answer, providing a natural way to explain its QA reasoning.


Introduction
The task of question and answering (QA) has been extensively studied, where many of the existing applications and datasets have been focused on the fact retrieval task from a large-scale knowledge graph (KG) (Bordes et al., 2015), or machine reading comprehension (MRC) approaches given unstructured text (Rajpurkar et al., 2018). In this work, we introduce the new task and dataset for Episodic Memory QA, in which the model answers personal and retrospective questions based on memory graphs (MG), where each episodic memory and its related entities (e.g. knowledge graph (KG) entities, participants, ...) are repre- Figure 1: Illustration of Episodic Memory QA with user queries and memory graphs (MG) with knowledge graph (KG) entities. Relevant memory nodes are provided as initial memory slots via graph search lookup. The Memory Graph Network walks from the initial nodes to attend to relevant contexts and expands the memory slots when necessary. The main QA model takes these graph traversal paths and expanded memory slots as input, and predicts correct answers via multiple module networks (e.g. COUNT, CHOOSE, etc.) sented as the nodes connected via corresponding edges ( Figure 1). Examples of such queries include "Where did we go after we had brunch with Jon?", "How many times did I go to jazz concerts last year?", etc. For Episodic Memory QA, a machine has to understand the contexts of a question and navigate multiple MG episode nodes as well as KG nodes to gather comprehensive information to match the query requirement.
While the ability of querying a personal database could lead to many potential applications, previous work in this domain (Jiang et al., 2018) is limited due to the lack of a large-scale Figure 2: Overall architecture of the Episodic Memory QA Network. For an input query q, candidate memory nodes m = {m (k) } are provided as input memory slots for the Memory QA Network. The Memory Graph Network then traverses the memory graph to expand the initial memory slots and activate other relevant entity and memory nodes based on the input queries. The Answer Module Networks execute the predicted neural programs to decode answers given the memory graph network outputs. dataset and a unique set of challenges unseen in other tasks: For example, we observe that 1) Memory QA queries often include ambiguous and incomplete descriptions of reference memory (as opposed to many conventional Fact QAs with unambiguous mentions, e.g. "Who painted the Mona Lisa?"), hence requiring extensive candidate memory generation. Another challenge we observe is the case where 2) target memory is only indirectly linked to reference memory or entities (e.g. "Where did we go after brunch?"), which makes the conventional information retrieval (IR) approaches for generating answer candidates ineffective. In addition, 3) queries are not confined to retrieval tasks, but include various types of questions such as counting, set comparing, etc., many of which remain unsolved or not considered in many QA tasks.
To this end, we propose a new model called the Memory Graph Network (MGN) to address the specific challenges stated above that come with Memory QA. While Memory Networks have successfully been used in QA applications, typical limitations are that memory slots are limited to the fixed number of slots, often in sentence or bag-of-symbols forms. MGN extends the popular memory networks by storing graph nodes as memory slots and by allowing the network to dynamically expand memory slots through graph traversals. We then implement the main Episodic Memory QA Network with multiple module networks such as CHOOSE, COUNT, etc., to effectively handle various question types not easily handled via graph networks.
To bootstrap a large-scale dataset collection for Episodic Memory QA, we first build a synthetic memory graph generator, which creates multiple episodic memory graph nodes connected with real entities (e.g. locations, events, public entities) appearing on common-fact KGs. By creating a realistic memory graph that is synthetically generated, we avoid the need for inferring memory graphs from other structured data (e.g. photo albums) which are often limited in size. We then generate 100K QA pairs for each memory node with templates composed by human annotators, combined with 1K manual paraphrasing steps. More details for dataset collection are provided in Section 3. Figure 2 illustrates the overall architecture and the model components that make up the Episodic Memory QA Net. We examine each module in detail and provide our rationale about its formulation in the following sections. Input Module (Section 2.1): For a given query q, its relevant memory nodes m = {m (k) } K k=1 for slot size K are given as initial memory slots. At test time, relevant memory nodes can be retrieved from a graph search engine that measures textual similarity (e.g. n-gram TF-IDF) between its connected node contexts and query. The Query Encoder then encodes the input query with a language model, and the Memory Encoder encodes each memory slot for both structural and semantic properties of each memory. Memory Graph Networks (MGN) (Section 2.2): Many previous work in QA or MRC systems use memory networks to evaluate multiple answer candidates with transitive reasoning, and typically store all potentially relevant raw sentences or bag-of-symbols as memory slots. However, naive increase of memory slot size or retentionbased sequential update of memory slots often increase search space for answer candidates, leading to poor precision especially for the Episodic Memory QA task. To overcome this issue, with MGN we store memory graph nodes as initial memory slots, where additional contexts and answer candidates can be succinctly expanded and reached via graph traversals. For each (q, m (k) ) pair, MGN predicts optimal memory slot expansion steps:

Method
n,t ]} T t=1 for edge paths p e and corresponding node paths p n (Figure 3). QA Modules (Section 2.3, 2.4): An estimated answerâ = QA(m, q) is predicted given a query and MGN graph path output from initial memory slots. Specifically, the model outputs a module program {u (k) } for several module networks (e.g. CHOOSE, COUNT, ...) via module selector, each of which produces an answer vector . The aggregated result of module network outputs determines the top-k answers.

Input Encoding
Query encoder: We represent each textual query with an attention-based Bi-LSTM language model (Conneau et al., 2017) with GloVe (Pennington et al., 2014) distributed word embeddings trained on the Wikipedia and the Gigaword corpus with a total of 6B tokens. Memory encoder: We represent each memory node based on both its structural features (graph embeddings) and contextual multi-modal features from its neighboring nodes (e.g. attribute values).
Structural contexts of each memory node (m s ) are encoded via graph embeddings projection approaches (Bordes et al., 2013), in which nodes with similar relation connectivity are mapped closer in the embeddings space. The model for obtaining embeddings from a MG (composed of subject-relation-object (s, r, o) triples) can be formulated as follows: where I r is an indicator function of a known relation r for two entities (s,o) (1: valid relation, 0: unknown relation), e is a function that extracts embeddings for entities, e r extracts embeddings for relations, and score(·) is a function (e.g. multilayer perceptrons) that produces a likelihood of a valid triple.
For contextual representation of memories (m c ), we compute attention-weighted sum of textual representation of neighboring nodes and attributes (connected via r j ∈ R), using the same language model as the query encoder: Note that the query attention vector γ attenuates or amplifies each attribute of memory based on a query vector to better account for query-memory compatibility accordingly. We then concatenate the structural features with semantic contextual features to obtain the final memory representation (m = [m s ; m c ]).

Memory Graph Network
Inspired by the recently introduced graph traversal networks (Moon et al., 2019) which output discrete graph operations given input contexts, we formulate our MGN as follows. Given a set of initial memory slots (m) and a query (q), the MGN model outputs a sequence path of walk steps (p) within MG to attend to relevant nodes or expand initial memory slots ( Figure 3): Specifically, we define the attention-based graph decoder model which prunes unattended paths, which effectively reduce the search space for memory expansion. We formulate the decoding steps for MGN as follows (bias terms for gates are omitted for simplicity of notation): where z t is a context vector at decoding step t, produced from the attention over graph relations which is defined as follows: where α t ∈ R |R| is an attention vector over the relations space, r k is relation embeddings, and z t is a resulting node context vector after walking from its previous node on an attended path.
The graph decoder is trained with the groundtruth walk paths by computing the combined loss of L walk (m, q, p) = i,t L e + L n between predicted paths and each of {p e , p n }, respectively (L e : loss for edge paths, and L n for node paths): At test time, we expand the memory slots by activating the nodes along the optimal paths based on the sum of their relevance scores (left) and softattention-based output path scores (right) at each decoding step:

Module Networks
MGN outputs are then passed to module networks for the final stage of answer prediction. We extend the previous work in module networks (Kottur et al., 2018), often used in VQA tasks, to accommodate for graph nodes output via MGN. We first formulate the module selector which outputs the module label probability {u (k) } given input contexts for each memory node, trained with cross-entropy loss L module : We then define the memory attention to attenuate or amplify all activated memory nodes based on their compatibility with query, formulated as follows: For this work, we propose the following four modules: CHOOSE, COUNT, CONFIRM, SET OR, and SET AND, hence u (k) ∈ R 5 . Note that the formulation can be extended to the auto-regressive decoder in case sequential execution of modules is required.
CHOOSE module outputs answer space vector by assigning weighted sum scores to nodes along the MGN soft-attention walk paths. End nodes with the most probable walk paths thus get the highest scores, and their node attribute values are considered as answer candidates. COUNT module counts the query-compatible among the activated nodes, a = W K ([α; max{α}; min{α}]). CONFIRM uses a similar approach to COUNT, except it outputs a binary label indicating whether the memory nodes match the query condition: a = W b ([α; max{α}; min{α}]). SET modules either combine or find intersection among answer candidates by updating the answer vectors with a = max{W s {a (k) }} or a = min{W s {a (k) }}.

Answer Decoding
Answers from each module network (Section 2.3 are then aggregated as weighted sum of answer vectors with module probability (Eq.6), guided by memory attention (Eq.7). Predicted answers are evaluated with cross-entropy loss L ans We observe that the model performs better when the MGN component of the model is pre-trained with ground-truth paths. We thus first train the MGN network with the same training split (without answer labels), and then train the entire model with module networks, fully end-to-end supervised with L = L walk + L module + L ans .

Dataset
To empirically evaluate the proposed approach for the Episodic Memory QA task, we create a new dataset, MemQA, of 100K question and answer pairs composed based on synthetic memory graphs that are artificially generated. We specifically use the synthetically generated MG to avoid the need for inferring memory graphs from other structured data (such as publicly available photo albums such as Flicker, etc.) which are often limited in size and domains. We bootstrap our large-scale realistic memory graph dataset with the following procedures: first, we construct a synthetic social graph with a set number of artificial users, each with randomly generated interest embeddings. We then create a realistic memory graph by randomly choosing participants within the synthetic social graph as well as activities and associated entities from the curated list (of locations, events, public entities, etc.), which is a subset of common-fact Freebase knowledge base (Bast et al., 2014). Each generated memory node thus has connections to entities appearing in KGs, comprising the memory graph together. Finally, given a set of reference memory nodes and neigh-

Empirical Evaluation
Task: Given a query and a set of initial memory graph nodes via graph search, we evaluate the model on the open-ended question answering prediction task.

Baselines
We choose as baselines the following QA systems (see Section 5 for details), and modify them accordingly to make comparisons with our task: • MemN2N (Sukhbaatar et al., 2016): uses the end-to-end memory networks with the static set of initial memory slots. Each memory slot is represented with a bag-of-symbols for surrounding attributes and nodes. We use a single Softmax layer for answer classification. Memory slot size is tuned as a hyperparameter.
• MemexNet (Jiang et al., 2018): uses the textual representation for multi-modal attributes, and frames the problem into a classification problem via text kernel match approaches to predict approximate answers. Since the dataset does not contain images, we omit the CNN representations.
We also consider several configurations of our proposed approach to examine contributions of   Main Results: Table 2 shows the results of the top-k predictions of the proposed model and the baselines. It can be seen that the proposed Memory QA model outperforms other QA baselines for precision at all ks.
Specifically, with the MGN walker model, the MemQA model learns to condition its walk path on query contexts and attend and expand memory nodes, thus outperforming the baseline models that simply rely on their initial memory slots, typically large in size to maintain reasonable recall. The node expansion via MGN allows the model to keep the initial memory slots small (10) and expand only when necessary (e.g. for queries that require references to related memories, "... where did I go after ..."), thus improving the precision performance. Note that memory slot sizes for baselines are tuned for their performance on the validation set.
In addition, it can be seen that the neural module components (G+N and G+N+E) greatly outperform the ablation model (G) and the baselines by aggregating answers with the modules specifically designed for various types of questions. These neural modules allow the model to answer questions that are typically hard to answer (e.g. count, set comparison, etc.) by explicitly reducing the answer space accordingly.
Note also that the graph embeddings (G+N+E) improve the performance over the ablation model that does not use structural contexts (G+N), indicating that the model learns to better leverage knowledge graph contexts to answer questions.
Error Analysis: Table 3 shows some of the example output from the proposed model, given the input question and memory graph nodes. It can be seen that the model is able to predict answers by combining answer contexts from multiple components (walk path, node attention, neural modules, etc.) In general, the MGN walker successfully explores the respective single-hop or multi-hop relations within the memory graph, while keeping the initial memory slots small enough. The activated nodes via graph traversals are then used as input for each neural module, the aggregated results of which are the final top-k answer predictions. There are some cases where the final answer prediction is incorrect, whereas its walk path is correctly predicted. This is due to inaccurate prediction of memory attention vector given a query and initial memory slots, which requires compre- hensive understanding of surrounding knowledge nodes in the context of the query.

Related Work
Memory Networks: Weston et al. (2014); Sukhbaatar et al. (2016) propose Memory Networks with explicit memory slots to contain auxiliary multi-input, now widely used in many QA and MRC tasks for its transitive reasoning capability. Traditional limitations are that memory slots for storing answer candidates are fixed in size, and naively increasing the slot size typically decreases the precision. Several work extend this line of research, for example by allowing for dynamic update of memory slots given streams of input (Kumar et al., 2016;Tran et al., 2016;Xu et al., 2019), reinforcement learning based retention control (Jung et al., 2018), etc. By allowing for storing graph nodes as memory slots and for slot expansion via graph traversals, our proposed Memory Graph Networks (MGN) effectively bypass the issues. Structured QA systems: often answer questions based on large-scale common fact knowledge graphs (Bordes et al., 2015;tau Yih et al., 2015;Xu et al., 2016;Jain, 2016;Yin et al., 2016;Dubey et al., 2018), typically via an entity linking system and a QA model for predicting graph operations through template matching approaches, etc. Our approach is inspired by this line of work, and we utilize the proposed module networks and the MGN walker model to address unique challenges to Episodic Memory QA.
Machine Reading Comprehension (MRC) systems: aim at predicting answers given evidence documents, typically in length of a few paragraphs (Seo et al., 2017;Rajpurkar et al., 2016Rajpurkar et al., , 2018Cao et al., 2019;tau Yih et al., 2015). Several recent work address multi-hop reasoning within multiple documents (Yang et al., 2018;Welbl et al., 2018;Bauer et al., 2018;Clark et al., 2018) or conversational settings (Choi et al., 2018;Reddy et al., 2018), which require often complex reasoning tools. Unlike in MRC systems that typically rely on language understanding, we effectively utilize structural properties of memory graph to traverse and highlight specific attributes or nodes that are required to answer questions.
Visual QA systems: aim to answer questions based on contexts from images (Antol et al., 2015;. Recently, neural modules (Kottur et al., 2018) are proposed to address specific challenges to VQA such as visual co-reference resolutions, etc. Our work extends the idea of neural modules for Episodic Memory QA by implementing modules that can take graph paths as input for answer decoding. Jiang et al. (2018) proposes visual memex QA which tackles similar problem domains given a dataset collected around photo albums. Instead of relying on meta information and multi-modal content of a photo album, our work explicitly utilizes semantic and structural contexts from memory and knowledge graphs. Another recent line of work for VQA includes graph based visual learning (Hudson and Manning, 2019), which aims to represent each image with a sub-graph of visual contexts. While graph-based VQA operates on a graph constructed from a single scene, Episodic Memory QA operates on a large-scale memory graph with knowledge nodes. We therefore propose memory graph networks to handle ambiguous candidate nodes, a main contribution of the proposed work.

Conclusions
We introduce Episodic Memory QA, the task of answering personal user questions grounded on memory graph (MG). The dataset is generated with synthetic memory graphs with simulated attributes, and accompanied with 100K QA pairs composed via bootstrapped scripts and manual annotations. Several novel model components are proposed for unique challenges for the Episodic Memory QA: 1) Memory Graph Networks (MGN) extends the conventional memory networks by enabling dynamic expansion of memory slots through graph traversals, which also naturally allows for explainable predictions. 2) Several neural module networks are proposed for the proposed task, each of which takes queries and memory graphs as input to infer answers. 3) The main Episodic Memory QA Net aggregates answer prediction from each neural module to generate final answer candidates. The empirical results demonstrate the efficacy of the proposed model in the Memory QA reasoning.