Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data

We consider a novel question answering (QA) task where the machine needs to read from large streaming data (long documents or videos) without knowing when the questions will be given, which is difficult to solve with existing QA methods due to their lack of scalability. To tackle this problem, we propose a novel end-to-end deep network model for reading comprehension, which we refer to as Episodic Memory Reader (EMR) that sequentially reads the input contexts into an external memory, while replacing memories that are less important for answering unseen questions. Specifically, we train an RL agent to replace a memory entry when the memory is full, in order to maximize its QA accuracy at a future timepoint, while encoding the external memory using either the GRU or the Transformer architecture to learn representations that considers relative importance between the memory entries. We validate our model on a synthetic dataset (bAbI) as well as real-world large-scale textual QA (TriviaQA) and video QA (TVQA) datasets, on which it achieves significant improvements over rule based memory scheduling policies or an RL based baseline that independently learns the query-specific importance of each memory.


Introduction
Question answering (QA) problem is one of the most important challenges in Natural Language Understanding (NLU). In recent years, there has been drastic progress on the topic, owing to the success of deep learning based QA models (Sukhbaatar et al., 2015;Seo et al., 2016;Xiong et al., 2016;Hu et al., 2018;Devlin et al., 2018). On certain tasks such as machine reading comprehension (MRC), where the problem is to find the span of the answer within a given paragraph (Rajpurkar et al., 2016), the * Equal contribution deep-learning based QA models have even surpassed human-level performances.
Despite such impressive achievements, it is still challenging to model question answering with document-level context (Joshi et al., 2017), where the context may include a long document with a large number of paragraphs, due to problems such as difficulty in modeling long-term dependency and computational cost. To overcome such scalability problems, researchers have proposed pipelining or confidence based selection methods that combine paragraph-level models to obtain a document-level model (Joshi et al., 2017;Clark and Gardner, 2018;Wang et al., 2018b). Yet, such models are applicable only when questions are given beforehand and all sentences in the document can be stored in memory.
However, in realistic settings, the amount of context may be too large to fit into the system memory. We may consider query-based context selection methods such as ones proposed in  and , but in many cases, the question may not be given when reading in the context, and thus it would be difficult to select out the context based on the question. For example, a conversation agent may need to answer a question after numerous conversations in a long-term time period, and a video QA model may need to watch an entire movie, or a sports game, or days of streaming videos from security cameras before answering a question. In such cases, existing QA models will fail to solve the problem due to memory limitation.
In this paper, we target a novel problem of solving question answering problem with streaming data as context, where the size of the context could be significantly larger than what the memory can accommodate (See Figure 1). In such a case, the model needs to carefully manage what to remember from this streaming data such that the memory Figure 1: Concept: We consider a novel problem of learning from streaming data, where the QA model may need to answer a question that is given after reading in unlimited amount of context. To solve this problem, our Episodic Memory Reader (EMR) learns to retain the most important context vectors in an external memory, while replacing the memory entries in order to maximize its accuracy on an unseen question given at a future timestep.
contains the most informative context instances in order to answer an unseen question in the future. We pose this memory management problem as a learning problem and train both the memory representation and the scheduling agent using reinforcement learning.
Specifically, we propose to train the memory module itself using reinforcement learning to replace the most uninformative memory entry in order to maximize its reward on a given task. However, this is a seemingly ill-posed problem since for most of the time, the scheduling should be performed without knowing which question will arrive next. To tackle this challenge, we implement the policy network and the value network that learn not only relation between sentences and query but also relative importance among the sentences in order to maximize its question answering accuracy at a future timepoint. We refer to this network as Episodic Memory Reader (EMR). EMR can perform selective memorization to keep a compact set of important context that will be useful for future tasks in lifelong learning scenarios.
We validate our proposed memory network on a large-scale QA task (Trivia QA) and video question answering task (TVQA) where the context is too large to fit into the external memory, against rule-based and an RL-based scheduling method without consideration of relative importance between memories. The results show that our model significantly outperforms the baselines, due to its ability to preserve the most important pieces of information from the streaming data.
Our contribution is threefold: • We consider a novel task of learning to remember important instances from streaming data for question answering task, where the size of the memory is significantly smaller than the length of the data stream.
• We propose a novel end-to-end neural archi-tecture for QA from streaming data, where we train a scheduling agent via reinforcement learning to store the most important memory cell for solving future QA tasks in the global external memory.
• We validate the efficacy of our model on real-world, large-scale text and video QA datasets, on which it obtains significantly improved performances over baseline methods.

Related Work
Question-answering There has been a rapid progress in question answering (QA) in recent years, thanks to the advancement in deep learning as well as the availability of large-scale datasets. One of the most popular large-scale QA dataset is Stanford Question Answering Dataset (SQuAD, (Rajpurkar et al., 2016) that consists of 10K question-answering pairs. Unlike (Richardson et al., 2013) and (Hermann et al., 2015) that provide multiple-choice QA pairs, SQuAD provides and requires to predict exact locations of the answers. On this span prediction task, attentional models (Pan et al., 2017;Cui et al., 2017;Hu et al., 2018) have achieved impressive performances, with Bi-Directional Attention Flow (BiDAF, (Seo et al., 2016)) that uses bi-directional attention mechanism for the context and query being one of the best performing models. Trivi-aQA (Joshi et al., 2017) is another large-scale QA dataset that includes 950K QA pairs. Since the length of each document in Trivia is much longer than SQuAD, with average of 3K sentences per document, existing span prediction models (Joshi et al., 2017; fail to work due to memory limitation, and simply resort to document truncation. Video question answering (Tapaswi et al., 2016;Lei et al., 2018), where video frames are given as context for QA, is another important topic where scalability is an issue. Several models Na et al., 2017;Wang et al., 2018a) propose to solve video QA using attentions and memory augmented networks, to perform composite reasoning over both videos and texts; however, they only focus on short-length videos. Most existing work on QA focus on small-size problems due to memory limitation. Our work, on the other hand, considers a challenging scenario where the context is order of magnitude larger than the memory.
Context selection A few recent models propose to select minimal context from the given document when answering questions for scalability, rather than using the full context.  proposed a context selector that generates attentions on the context vectors, in order to achieve scability and robustness against adversarial inputs.  and  propose a similar method, but they use REIN-FORCE (Williams, 1992) instead of linear classifiers. While these context selection methods share our motivation of achieving scability and selecting out the most informative pieces of information to solve the QA task, our problem setting is completely different from theirs since we consider a much challenging problem of learning from the streaming data without knowing when the question will be given, where the size of the context is much larger than the memory and the question is unseen when training the selection module.
Memory-augmented neural networks Our episodic memory reader is essentially a memoryaugmented network (MANN) (Sukhbaatar et al., 2015;Graves et al., 2014;Xiong et al., 2016) with a RL-based scheduler. While most existing work on MANN assume that the memory is sufficiently large to hold all the data instances, a few tried to consider memory-scheduling for better scalability. Gülçehre et al. (2016) propose to train an addressing agent using reinforcement learning in order to dynamically decide which memory to overwrite based on the query. This query-specific importance is similar to our motivation, but in our case the query is given after reading in all the context and thus unusable for scheduling, and we perform hard replacement instead of overwriting. Differentiable Neural Computer (DNC)  extends the NTM to address the issue by introducing a temporal link matrix, replacing the least used memory when the memory is full. However, this method is a rule-based one that cannot maximize the performance on a given task.

Learning What to Remember from Streaming Data
We now describe how to solve question answering tasks with streaming data as context. In a more general sense, this is a problem of learning from a long data stream that contains a large portion of unimportant, noisy data (e.g. routine greetings in dialogs, uninformative video frames) with limited memory. The data stream is episodic, where an unlimited amount of data instances may arrive at one time interval and becomes inaccessible afterward. Additionally, we consider that it is not possible for the model to know in advance what tasks (a question in the case of QA problem) will be given at which timestep in the future (See Figure 2 for more details). To solve this problem, the model needs to identify important data instances from the data stream and store them into external memory. Formally, given a data stream (e.g. sentences or images) X = {x (1) , · · · , x (T ) } as input, the model should learn a function F : X → M that maps it to the set of memory entries M = {m 1 , · · · , m N } where T N . How can we then learn such a function that maximizes the performance on unseen future tasks without knowing what problems will be given at what time? We formulate this problem as a reinforcement learning problem to train a memory scheduling agent.

Model Overview
We now describe our model, Episodic Memory Reader (EMR) to solve the previously described problem. Our model has three components: (1) an agent A based on EMR, (2) an external memory M = [m 1 , · · · , m N ], and (3) a target network which solves the target task given the memory and a query. Figure 2 shows the overview of our model. Basically, given a sequence of data instances X = {x (1) , · · · , x (T ) } that streams through the system, the agent learns to retain the most useful subset in the memory, by interacting with the external memory that encodes the relative importance of each memory entry. When t ≤ N , the agent simply writes x (t) to m (t) . However, when t > N , where the memory is full, it selects an existing memory entry to delete. Specifically, it outputs an action based on π(m is selection of the memory entry m (t) i to Figure 2: The overview of our Episodic Memory Reader (EMR). EMR learns the policy and the value network to select a memory entry to replace, in order to maximize the reward, defined as the performance on future QA tasks (F1-score, accuracy). delete, where the state is the concatenation of the memory and the data instance: Thus either x (t) or one of the memory entries will be deleted. To maximize the performance on the future QA task, the agent should replace the least important memory entry. When the agent encounters the task T at timestep T + 1, it leverages both the memory at timestep T , M (T ) and the task information (e.g. question), to solve the task. For each action, the environment (question answering module) provides the reward R (t) , that is given either as the F1-score or the accuracy.

Episodic Memory Reader
Episodic Memory Reader (EMR) is composed of three components: (1) Data Encoder that encodes each data instance into memory vector representation, (2) Memory Encoder that generates replacement probability for each memory entry, and the (3) Value Network that estimates the value of memory as a whole. In some cases, we may use policy gradient methods, in which case the value network becomes unnecessary.

Data Encoder
The data instance x (t) which arrives at time t can be in any data format, and thus we transform it into a memory vector representation m (t) i ∈ R d to be used by the agent using an encoder: where ψ(·) is the data encoder, which could be any neural architecture based on the type of the input data. For example, we could use a RNN if x (t) is composed of sequential data (e.g. a sentence composed of words x (t) = {w 1 , w 2 , w 3 , · · · , w s }) or a CNN if x (t) is an image.

Memory Encoder
Using the memory vector representations N ] generated from the data encoder, the memory encoder outputs a probability for each memory entry by considering the importance between the entries and then replaces the most unimportant entry. This component corresponds to the policy network of the actor-critic method. Now we describe our ERM models: EMR-Independent Since we do not have existing work for our novel problem setting, as a baseline, we first consider a memory encoder that only captures the relative importance of each memory entry independently to the new data instance, which we refer to as EMR-Independent. This scheduling mechanism is adopted from Dynamic Least Recently Use (LRU) addressing introduced in Gülçehre et al. (2016), but different from LRU in that it replaces the memory entry rather than overwriting it, and is trained without query to maximize the performance for unseen future queries. EMR-Independent outputs the importance for each memory entry by comparing them with an embedding of the new data instance To compute the overall importance of each memory entry, as done in Gülçehre et al. (2016), we compute the exponential moving average as v i . Then, we compute the final replacing probabilty with the LRU factor γ (t) as follows: where W γ ∈ R 1×d , b γ ∈ R, σ(z) and softmax(z i ) are sigmoid and softmax functions respectively, and π(m (t) i |M (t) , x (t) ; θ) is the policy of the agent.
EMR-biGRU A major drawback of EMR-Independent is that the evaluation of each memory depends only on the input x (t) . In other words, the importance is computed between each memory entry and the new data instance regardless of other entries in the memory. However, this scheme cannot model the relative importance of each memory entry to other memory entries, which is more important in deciding on the least important memory. One way to consider relative relationships between memory entries is to encode them using a bidirectional GRU (biGRU) as follows: i ] is a concatenation of features. Thus, it learns the general importance of each memory entry in relation to its neighbors rather than considering independent for each entry, which is useful when selecting out the most important entry among highly similar data instances (e.g. video frames). However, the model may not effectively model long-range relationships between memory entries in far-away slots due to the inherent limitation with RNNs.

EMR-Transformer
To overcome such suboptimality of RNN-based modeling, we further adopt the self-attention mechanism from (Vaswani et al., 2017). Query Q (t) , key K (t) , and the value V (t) are the components of the self-attention used to generate the relative importance of the entries, which are computed by a linear layer that takes m (t) with the position encoding proposed in (Vaswani et al., 2017) as input. With multiheaded attention, each component is projected to a multi-dimensional space; the dimensions for each componenets are where m is the size of memory and h is the number of attention heads. Using these, we can formulate the retrieved output using self-attention and memory encoding as follows: We use a multi-layer perceptron (MLP) to generate the replacement probability for each entry as follows: where the MLP is a multi-layer perceptron with three linear layers, using ReLU as the activation fuction. The agent then selects a memory to replace based on the policy π(m (t) i |M (t) , x (t) ; θ). Figure 3 illustrates the architecture of the memory encoder for EMR-Independent and EMR-biGRU/Transformer.

Value Network
For solving certain QA problems, we need to consider the future importance of each memory entry. Especially in textual QA tasks (e.g. TriviaQA), storing the evidence sentences that precede span words may be useful as they may provide useful context. However, using only discrete policy gradient method, we cannot preserve such context instances. To overcome this issue, we use an actorcritic RL method (A3C, (Mnih et al., 2016)) to estimate the sum of future rewards at each state using the value network. The difference between the policy and the value is that the value can be estimated differently at each time step and the needs to consider the memory as a whole. To obtain a holistic representation of our memory, we use Deep Sets (Zaheer et al., 2017). Following Zaheer et al. (2017) we sum up all h  an MLP (ρ), that consists of two linear layers and a ReLU activation function, to obtain a set representation. Then, we further process the set repre- i ) by a GRU with the hidden state from the previous time step. Finally, we feed the output of the GRU to a multi-layer perceptron to estimate the value V (t) for the current timestep.

Training and test
Training Our model learns the memory scheduling policy jointly with the model to solve the task. For training EMR, we choose A3C (Mnih et al., 2016) or REINFORCE (Williams, 1992). At training time, since the tasks are given, we provide the question to the agent at every timestep. At each step, the agent selects the action stochastically from multinomial distribution based on π(m (t) i |M (t) , x (t) ; θ) to explore various states, and make an action. Then, the QA model provides the agent the reward R t . We use asynchronous multiprocessing method illustrated in (Mnih et al., 2016) to train several models at once.
Test At test time, the agent deletes the memory index following arg max i π(m (t) i |M (t) , x (t) ; θ). Contrarily from the training step, the model observes the question only at the end of the data stream. When encountering the question, the model solves the task using the data instances kept in the external memory.

Experiment
We experiment our ERM-biGRU and EMR-Transformer against several baselines: 1) FIFO (First-In First-Out). A rule-based memory scheduling policy that replaces the oldest memory entry.
2) Uniform. A policy that replaces all memory entries with equal probability at each time.
3) LIFO (Last-In First-Out). A policy that replaces the newest data instance. That is, it first fills in the memory and then discards all following data instances.
4) EMR-Independent. A baseline EMR which learns the importance of each memory entry only relative to the new data instance.
We will release the codes for reproduction, if the paper is accepted.

TriviaQA
Dataset TriviaQA (Joshi et al., 2017) is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and the web. This dataset is more challenging than standard QA benchmark datasets such as Stanford Question Answering Dataset (SQuAD, (Rajpurkar et al., 2016)), as the answers for a question may not be directly obtained by span prediction and the context is very long (Figure 4). Since conventional QA models (Seo et al., 2016;Devlin et al., 2018) are span prediction models, on TriviaQA they only train on QA pairs whose answers can be found in the given context. In such a setting, Trivi-aQA becomes highly biased, where the answers are mostly spanned in the earlier part of the document ( Figure 5). We evaluate our work only on the Wikipedia domain since most previous work report similar results on both domains. While TriviaQA dataset consists of both human-verified and machine-generated QA subsets, we use the human-verified subset only since the machinegenerated QA pairs are unreliable. We use the validation set for test since the test set does not contain labels.

Experiment Details
We employ the pre-trained model from Deep Bidirectional Transformers (BERT, (Devlin et al., 2018)), which is the current state-of-the-art model for SQuAD challenge, that trains several Transformers in (Vaswani et al., 2017) for pretraining tasks for predicting the indices of the exact location of an answer. We embed 20 words into each memory cell using a GRU and set the number of cells to 20, thus the memory Figure 6: A visualization of how our model operates in order to solve the problem. Episodic Memory Reader (EMR) sequentially reads the sentences one by one while replacing least important memories. For this example, EMR retained the sentences with the word 'France' (bold fonts) in order to answer a given question. can hold 400 words at maximum. This is a reasonable restriction since BERT limites the maximum number of word tokens to 512, including both the context and the query. We will release the codes for better reproducibility, if the paper is accepted.

Results and Analysis
We report the performance of our model on the TriviaQA using both ExactMatch and F1-score in Table 1. We see that ERM models which consider the relative importance between the memory entries (EMR-biGRU and EMR-Transformer) outperform both the rulebased baselines and ERM-Independent. One interesting observation is that LIFO performs quite well unlike the other rule-based scheduling policies, but this is due to the dataset bias (See Figure 5) where most answers are spanned in earlier part of the documents. To further see whether this improvement is from its ability to remember important context, we examine the sentences that remain in the memory after ERM finishes reading all the sentences in Figure 6. We see that ERM remembered the sentences that contain key words that is required to answer the future question. See supplementary file for more examples.

TVQA
Dataset TVQA (Lei et al., 2018) is a localized, compositional video question-answering dataset that contains 153K question-answer pairs from 22K clips spanning over 460 hours of video. The questions are multiple choice questions on the video contents, where the task is to find a single correct answer out of five candidate answers. The questions can be answered by examining the annotated clip segments, which spans around 30 frames per clip (See Figure 7 (a)). The average number of frames for each clip is 229. In addition to the video frames, the dataset also provides subtitles for each video frame. Thus solving the questions requires compositional reasoning capability over both a large number of images and texts.
Experiment Details As for the QA module, we use Multi-stream model for Multi-Modal Video QA, which is the attention-based baseline model provided in (Lei et al., 2018). For efficient training, we use features extracted from a ResNet-101 pretrained on the ImageNet dataset. For embedding subtitles and question-answering pairs, we use GloVe (Pennington et al., 2014). For training, we restrict the number of memory entries for our episodic reader as 20, where each memory entry contains the encoding of a video frame and the subtitle associated with the frame, where the former is encoded using CNN and the latter using GRU. We train our model and the baseline models using the ADAM optimizer (Kingma and Ba, 2014), with the initial learning rate of 0.0001. Unlike from the experiments on TriviaQA, we use REINFORCE (Williams, 1992) to train the policy. This is because TVQA is composed of consecutive image frames captured within a short time interval, which tend to contain redundant information.
Thus the value network of the actor-critic model fails to estimate good value of the given state since deleting a good frame will not result in the loss of QA accuracy. Thus we compute the reward R (t) as the accuracy difference between at time step t and t − 1 then use only the policy with non-episodic  REINFORCE for training. With this method, if the task fails to solve the question after deleting certain frame, the frame is considered as important, and unimportant otherwise. We will release the codes for better reproducibility, if the paper is accepted.

Results and Analysis
We report the accuracy on TVQA as a function of memory size in Figure 8. We observe that EMR variants significantly outperform all baselines, including EMR-Independent. We also observe that the models perform well even when the size of the memory is increased to as large as 60, which was never encountered during the training stage where the number of memory entries was fixed as 20. When the size of memory is small, the gap between different models are larger, with EMR-Transformer obtaining the best accuracy, which may be due to its ability to capture global relative importance of each memory entry. However, the gap between EMR-Transformer and EMR-biGRU diminishes as the size of memory increases, since then the size of the memory becomes large enough to contain all the frames necessary to answer the question.
As qualitative analysis, we further examine which frames and subtitles were preserved in the external memory after the model has read through the entire sequence in Figure 8. To answer the question for this example, the model should consider the relationship between two frames, where the first frame describes Ross showing the paper to others, and the second frame describes Monica entering the coffee shop. We see that our model kept both frames, although it did not know what the question will be. See Appendix for more examples.

Conclusion
We proposed a novel problem of question answering from streaming data, where the model needs to answer a question that is given after reading through unlimited amount of context (e.g. documents, videos) that cannot fit into the system memory. To hanlde this problem, we proposed Episodic Memory Reader (EMR), which is basically a memory-augmented network with RLbased memory-scheduler, that learns the relative importance among memory entries and replaces the entries with the lowest importance to maximize the QA performance for future tasks. We validated EMR on large-scale text and video QA datasets against rule-based memory scheduling as well as an RL-baseline that does not model relative importances among memory entries, which it significantly outperforms. Further qualitative analysis of memory contents after learning confirms that such good performance comes from its ability to retain important instances for future QA tasks.       Figure 15: An example of clip from drama 'When I met your mother'. Each frame with star is corresponding to question with the star of same color.