Interactive Machine Comprehension with Information Seeking Agents

Existing machine reading comprehension (MRC) models do not scale effectively to real-world applications like web-level information retrieval and question answering (QA). We argue that this stems from the nature of MRC datasets: most of these are static environments wherein the supporting documents and all necessary information are fully observed. In this paper, we propose a simple method that reframes existing MRC datasets as interactive, partially observable environments. Specifically, we “occlude” the majority of a document’s text and add context-sensitive commands that reveal “glimpses” of the hidden text to a model. We repurpose SQuAD and NewsQA as an initial case study, and then show how the interactive corpora can be used to train a model that seeks relevant information through sequential decision making. We believe that this setting can contribute in scaling models to web-level QA scenarios.


Introduction
Many machine reading comprehension (MRC) datasets have been released in recent years (Rajpurkar et al. 2016;Trischler et al. 2016;Nguyen et al. 2016;Reddy, Chen, and Manning 2018;Yang et al. 2018) to benchmark a system's ability to understand and reason over natural language.Typically, these datasets require an MRC model to read through a document to answer a question about information contained therein.
The supporting document is, more often than not, static and fully observable.This raises concerns, since models may find answers simply through shallow pattern matching; e.g., syntactic similarity between the words in questions and documents.As pointed out by Sugawara et al., for questions starting with when, models tend to predict the only date/time answer in the supporting document.Such behavior limits the generality and usefulness of MRC models, and suggests that they do not learn a proper 'understanding' of the intended task.In this paper, to address this problem, we shift the focus of MRC data away from 'spoon-feeding' models with loss suffered during the 2008 recession .It was worth $ 32 billion in 2011 , up from $ 28 billion in September 2010 and $ 26 billion in 2009 .

Predicion $ 32 billion
Table 1: Examples of interactive machine reading comprehension behavior.In the upper example, the agent has no memory of past observations, and thus it answers questions only with observation string at current step.In the lower example, the agent is able to use its memory to find answers.
sufficient information in fully observable, static documents.Instead, we propose interactive versions of existing MRC tasks, whereby the information needed to answer a question must be gathered sequentially.
The key idea behind our proposed interactive MRC (iMRC) is to restrict the document context that a model observes at one time.Concretely, we split a supporting document into its component sentences and withhold these sentences from the model.Given a question, the model must issue commands to observe sentences in the withheld set; we equip models with actions such as Ctrl+F (search for token) and stop for searching through partially observed documents.A model searches iteratively, conditioning each command on the input question and the sentences it has observed previously.Thus, our task requires models to 'feed themselves' rather than spoon-feeding them with information.This casts MRC as a sequential decision-making problem amenable to reinforcement learning (RL).
As an initial case study, we repurpose two well known, related corpora with different difficulty levels for our interactive MRC task: SQuAD and NewsQA.Table 1 shows some examples of a model performing interactive MRC on these datasets.Naturally, our reframing makes the MRC problem harder; however, we believe the added demands of iMRC more closely match web-level QA and may lead to deeper comprehension of documents' content.
The main contributions of this work are as follows: 1. We describe a method to make MRC datasets interactive and formulate the new task as an RL problem.
2. We develop a baseline agent that combines a top performing MRC model and a state-of-the-art RL optimization algorithm and test it on our iMRC tasks.
3. We conduct experiments on several variants of iMRC and discuss the significant challenges posed by our setting.

Related Works
Skip-reading (Yu, Lee, and Le 2017;Seo et al. 2017;Choi et al. 2017) is an existing setting in which MRC models read partial documents.Concretely, these methods assume that not all tokens in the input sequence are useful, and therefore learn to skip irrelevant tokens based on the current input and their internal memory.Since skipping decisions are discrete, the models are often optimized by the REINFORCE algorithm (Williams 1992).For example, the structural-jump-LSTM proposed in (Hansen et al. 2019) learns to skip and jump over chunks of text.In a similar vein, Han et al. designed a QA task where the model reads streaming data unidirectionally, without knowing when the question will be provided.Skip-reading approaches are limited in that they only consider jumping over a few consecutive tokens and the skipping operations are usually unidirectional.Based on the assumption that a single pass of reading may not provide sufficient information, multi-pass reading methods have also been studied (Sha, Qian, and Sui 2017;Shen et al. 2017).
Compared to skip-reading and multi-turn reading, our work enables an agent to jump through a document in a more dynamic manner, in some sense combining aspects of skip-reading and re-reading.For example, it can jump forward, backward, or to an arbitrary position, depending on the query.This also distinguishes the model we develop in  (Shen et al. 2017), where an agent decides when to stop unidirectional reading.
Recently, Geva and Berant propose DocQN, which is a DQN-based agent that leverages the (tree) structure of documents and navigates across sentences and paragraphs.The proposed method has been shown to outperform vanilla DQN and IR baselines on TriviaQA dataset.The main differences between our work and DocQA include: iMRC does not depend on extra meta information of documents (e.g., title, paragraph title) for building document trees as in DocQN; our proposed environment is partially-observable, and thus an agent is required to explore and memorize the environment via interaction; the action space in our setting (especially for the Ctrl+F command as defined in later section) is arguably larger than the tree sampling action space in DocQN.
Closely related to iMRC is work by (Bachman, Sordoni, and Trischler 2016), in which the authors introduce a collection of synthetic tasks to train and test information-seeking capabilities in neural models.We extend that work by developing a realistic and challenging text-based task.
Broadly speaking, our approach is also linked to the optimal stopping problem in the literature Markov decision processes (MDP) (Bertsekas 2000), where at each time-step the agent either continues or stops and accumulates reward.Here, we reformulate conventional QA tasks through the lens of optimal stopping, in hopes of improving over the shallow matching behaviors exhibited by many MRC systems.

iMRC: Making MRC Interactive
We build the iSQuAD and iNewsQA datasets based on SQuAD v1.12 (Rajpurkar et al. 2016) andNewsQA (Trischler et al. 2016).Both original datasets share similar properties.Specifically, every data-point consists of a tuple, {p, q, a}, where p represents a paragraph, q a question, and a is the answer.The answer is a word span defined by head and tail positions in p. NewsQA is more difficult than SQuAD because it has a larger vocabulary, more difficult questions, and longer source documents.
We first split every paragraph p into a list of sentences S = {s 1 , s 2 , ..., s n }, where n stands for number of sentences in p.Given a question q, rather than showing the entire paragraph p, we only show an agent the first sentence s 1 and withhold the rest.The agent must issue commands to reveal the hidden sentences progressively and thereby gather the information needed to answer question q.
An agent decides when to stop interacting and output an answer, but the number of interaction steps is limited. 3Once an agent has exhausted its step budget, it is forced to answer the question.

Interactive MRC as a POMDP
As described in the previous section, we convert MRC tasks into sequential decision-making problems (which we will refer to as games).These can be described naturally within the reinforcement learning (RL) framework.Formally, tasks in iMRC are partially observable Markov decision processes (POMDP) (Kaelbling, Littman, and Cassandra 1998).An iMRC data-point is a discrete-time POMDP defined by (S, T, A, Ω, O, R, γ), where γ ∈ [0, 1] is the discount factor and the other elements are described in detail below.
Environment States (S): The environment state at turn t in the game is s t ∈ S. It contains the complete internal information of the game, much of which is hidden from the agent.When an agent issues an action a t , the environment transitions to state s t+1 with probability T (s t+1 |s t , a t )).In this work, transition probabilities are either 0 or 1 (i.e., deterministic environment).
Actions (A): At each game turn t, the agent issues an action a t ∈ A. We will elaborate on the action space of iMRC in the action space section.
Observations (Ω): The text information perceived by the agent at a given game turn t is the agent's observation, o t ∈ Ω, which depends on the environment state and the previous action with probability O(o t |s t ).In this work, observation probabilities are either 0 or 1 (i.e., noiseless observation).
Reward Function (R): Based on its actions, the agent receives rewards r t = R(s t , a t ).Its objective is to maximize the expected discounted sum of rewards E [ t γ t r t ].

Action Space
To better describe the action space of iMRC, we split an agent's actions into two phases: information gathering and question answering.During the information gathering phase, the agent interacts with the environment to collect knowledge.It answers questions with its accumulated knowledge in the question answering phase.
Information Gathering: At step t of the information gathering phase, the agent can issue one of the following four actions to interact with the paragraph p, where p consists of n sentences and where the current observation corresponds to sentence s k , 1 ≤ k ≤ n: • Ctrl+F <query>: jump to the sentence that contains the next occurrence of "query"; • stop: terminate information gathering phase.
Question Answering: We follow the output format of both SQuAD and NewsQA, where an agent is required to point to the head and tail positions of an answer span within p. Assume that at step t the agent stops interacting and the observation o t is s k .The agent points to a head-tail position pair in s k .

Query Types
Given the question "When is the deadline of AAAI?", as a human, one might try searching "AAAI" on a search engine, follow the link to the official AAAI website, then search for keywords "deadline" or "due date" on the website to jump to a specific paragraph.Humans have a deep understanding of questions because of their significant background knowledge.As a result, the keywords they use to search are not limited to what appears in the question.Inspired by this observation, we study 3 query types for the Ctrl+F <query> command.

Dataset
1.One token from the question: the setting with smallest action space.Because iMRC deals with Ctrl+F commands by exact string matching, there is no guarantee that all sentences are accessible from question tokens only.
2. One token from the union of the question and the current observation: an intermediate level where the action space is larger.
3. One token from the dataset vocabulary: the action space is huge (see Table 2 for statistics of SQuAD and NewsQA).
It is guaranteed that all sentences in all documents are accessible through these tokens.

Evaluation Metric
Since iMRC involves both MRC and RL, we adopt evaluation metrics from both settings.First, as a question answering task, we use F 1 score to compare predicted answers against ground-truth, as in previous works.When there exist multiple ground-truth answers, we report the max F 1 score.

Baseline Agent
As a baseline, we propose QA-DQN, an agent that adopts components from QANet (Yu et al. 2018) and adds an extra command generation module inspired by LSTM-DQN (Narasimhan, Kulkarni, and Barzilay 2015).
As illustrated in Figure 1, the agent consists of three components: an encoder, an action generator, and a question answerer.More precisely, at a game step t, the encoder reads observation string o t and question string q to generate attention aggregated hidden representations M t .Using M t , the action generator outputs commands (defined in previous sections) to interact with iMRC.If the generated command is stop or the agent is forced to stop, the question answerer takes the current information at game step t to generate head and tail pointers for answering the question; otherwise, the information gathering procedure continues.
In this section, we describe the high-level model structure and training strategies of QA-DQN.We refer readers to Yu et al. for detailed information.We will release datasets and code in the near future.

Model Structure
In this section, we use game step t to denote one round of interaction between an agent with the iMRC environment.We use o t to denote text observation at game step t and q to denote question text.We use L to refer to a linear transformation.[•; •] denotes vector concatenation.
Encoder The encoder consists of an embedding layer, two stacks of transformer blocks (denoted as encoder transformer blocks and aggregation transformer blocks), and an attention layer.
In the embedding layer, we aggregate both word-and character-level embeddings.Word embeddings are initialized by the 300-dimension fastText (Mikolov et al. 2018) 4 Since SQuAD's test set is hidden.We split its training set to get a validation set and use the original development set for testing.We split on the article level to prevent overlap between the training and validation paragraphs.This yields 5,158 validation points.
vectors trained on Common Crawl (600B tokens), and are fixed during training.Character embeddings are initialized by 200-dimension random vectors.A convolutional layer with 96 kernels of size 5 is used to aggregate the sequence of characters.We use a max pooling layer on the character dimension, then a multi-layer perceptron (MLP) of size 96 is used to aggregate the concatenation of word-and characterlevel representations.A highway network (Srivastava, Greff, and Schmidhuber 2015) is used on top of this MLP.The resulting vectors are used as input to the encoding transformer blocks.
Each encoding transformer block consists of four convolutional layers (with shared weights), a self-attention layer, and an MLP.Each convolutional layer has 96 filters, each kernel's size is 7.In the self-attention layer, we use a block hidden size of 96 and a single head attention mechanism.Layer normalization and dropout are applied after each component inside the block.We add positional encoding into each block's input.We use one layer of such an encoding block.
At a game step t, the encoder processes text observation o t and question q to generate context-aware encodings h ot ∈ R L o t ×H1 and h q ∈ R L q ×H1 , where L ot and L q denote length of o t and q respectively, H 1 is 96.
Following Yu et al., we use a context-query attention layer to aggregate the two representations h ot and h q .Specifically, the attention layer first uses two MLPs to map h ot and h q into the same space, with the resulting representations denoted as h ot ∈ R L o t ×H2 and h q ∈ R L q ×H2 , in which, H 2 is 96.
Then, a tri-linear similarity function is used to compute the similarities between each pair of h ot and h q items: where indicates element-wise multiplication and w is trainable parameter vector of size 96.
We apply softmax to the resulting similarity matrix S along both dimensions, producing S A and S B .Information in the two representations are then aggregated as where h oq is aggregated observation representation.On top of the attention layer, a stack of aggregation transformer blocks is used to further map the observation representations to action representations and answer representations.The configuration parameters are the same as the encoder transformer blocks, except there are two convolution layers (with shared weights), and the number of blocks is 7.
Let M t ∈ R L o t ×H3 denote the output of the stack of aggregation transformer blocks, in which H 3 is 96.

Action Generator
The action generator takes M t as input and estimates Q-values for all possible actions.As described in previous section, when an action is a Ctrl+F command, it is composed of two tokens (the token "Ctrl+F" and the query token).Therefore, the action generator consists of three MLPs: (3) Here, the size of L shared ∈ R 95×150 ; L action has an output size of 4 or 2 depending on the number of actions available; the size of L ctrlf is the same as the size of a dataset's vocabulary size (depending on different query type settings, we mask out words in the vocabulary that are not query candidates).The overall Q-value is simply the sum of the two components: (4) Question Answerer Following (Yu et al. 2018), we append two extra stacks of aggregation transformer blocks on top of the encoder to compute head and tail positions: (5) Here, M head and M tail are outputs of the two extra transformer stacks, L 0 , L 1 , L 2 and L 3 are trainable parameters with output size 150, 150, 1 and 1, respectively.

Memory and Reward Shaping
Memory In iMRC, some questions may not be easily answerable based only on observation of a single sentence.To overcome this limitation, we provide an explicit memory mechanism to QA-DQN.Specifically, we use a queue to store strings that have been observed recently.The queue has a limited size of slots (we use queues of size [1,3,5] in this work).This prevents the agent from issuing next commands until the environment has been observed fully, in which case our task would degenerate to the standard MRC setting.The memory slots are reset episodically.
Reward Shaping Because the question answerer in QA-DQN is a pointing model, its performance relies heavily on whether the agent can find and stop at the sentence that contains the answer.We design a heuristic reward to encourage and guide this behavior.In particular, we assign a reward if the agent halts at game step k and the answer is a sub-string of o k (if larger memory slots are used, we assign this reward if the answer is a sub-string of the memory at game step k).We denote this reward as the sufficient information reward, since, if an agent sees the answer, it should have a good chance of having gathered sufficient information for the question (although this is not guaranteed).
Note this sufficient information reward is part of the design of QA-DQN, whereas the question answering score is the only metric used to evaluate an agent's performance on the iMRC task.Ctrl+F Only Mode As mentioned above, an agent might bypass Ctrl+F actions and explore an iMRC game only via next commands.We study this possibility in an ablation study, where we limit the agent to the Ctrl+F and stop commands.In this setting, an agent is forced to explore by means of search a queries.

Training Strategy
In this section, we describe our training strategy.We split the training pipeline into two parts for easy comprehension.We use Adam (Kingma and Ba 2014) as the step rule for optimization in both parts, with the learning rate set to 0.00025.
Action Generation iMRC games are interactive environments.We use an RL training algorithm to train the interactive information-gathering behavior of QA-DQN.We adopt the Rainbow algorithm proposed by Hessel et al., which integrates several extensions to the original Deep Q-Learning algorithm (Mnih et al. 2015).Rainbox exhibits state-of-theart performance on several RL benchmark tasks (e.g., Atari games).
During game playing, we use a mini-batch of size 10 and push all transitions (observation string, question string, generated command, reward) into a replay buffer of size 500,000.We do not compute losses directly using these transitions.After every 5 game steps, we randomly sample a mini-batch of 64 transitions from the replay buffer, compute loss, and update the network.
Detailed hyper-parameter settings for action generation are shown in Table 3.

Parameter Value
Discount γ 0.9 Noisy Nets σ 0 0.5 Target Network Period 1000 episodes Multi-step returns n n ∼ Uniform[1, 2, 3] Table 3: Hyper-parameter setup for action generation.Question Answering Similarly, we use another replay buffer to store question answering transitions (observation string when interaction stops, question string, ground-truth answer).
Because both iSQuAD and iNewsQA are converted from datasets that provide ground-truth answer positions, we can leverage this information and train the question answerer with supervised learning.Specifically, we only push question answering transitions when the ground-truth answer is in the observation string.For each transition, we convert the ground-truth answer head-and tail-positions from the SQuAD and NewsQA datasets to positions in the current observation string.After every 5 game steps, we randomly sample a mini-batch of 64 transitions from the replay buffer and train the question answerer using the Negative Log-Likelihood (NLL) loss.We use a dropout rate of 0.1.

Experimental Results
In this study, we focus on three factors and their effects on iMRC and the performance of the QA-DQN agent: 1. different Ctrl+F strategies, as described in the action space section; 2. enabled vs. disabled next and previous actions; 3. different memory slot sizes.
Below we report the baseline agent's training performance followed by its generalization performance on test data.

Mastering Training Games
It remains difficult for RL agents to master multiple games at the same time.In our case, each document-question pair can be considered a unique game, and there are hundred of thousands of them.Therefore, as is common practice in the RL literature, we study an agent's training curves.
Due to the space limitations, we select several representative settings to discuss in this section and provide QA-DQN's training and evaluation curves for all experimental settings in the Appendix.We provide the agent's suffi-cient information rewards (i.e., if the agent stopped at a state where the observation contains the answer) during training in Appendix as well.
Figure 2 shows QA-DQN's training performance (F 1 score) when next and previous actions are available.Figure 3 shows QA-DQN's training performance (F 1 score) when next and previous actions are disabled.Note that all training curves are averaged over 3 runs with different random seeds and all evaluation curves show the one run with max validation performance among the three.
From Figure 2, we can see that the three Ctrl+F strategies show similar difficulty levels when next and previous are available, although QA-DQN works slightly better when selecting a word from the question as query (especially on iNewsQA).However, from Figure 3 we observe that when next and previous are disabled, QA-DQN shows significant advantage when selecting a word from the question as query.This may due to the fact that when an agent must use Ctrl+F to navigate within documents, the set of question words is a much smaller action space in contrast to the other two settings.In the 4-action setting, an agent can rely on issuing next and previous actions to reach any sentence in a document.
The effect of action space size on model performance is particularly clear when using a datasets' entire vocabulary as query candidates in the 2-action setting.From Figure 3 (and figures with sufficient information rewards in the Appendix) we see QA-DQN has a hard time learning in this setting.As shown in Table 2, both datasets have a vocabulary size of more than 100k.This is much larger than in the other two settings, where on average the length of questions is around 10.This suggests that the methods with better sample efficiency are needed to act in more realistic problem settings with huge action spaces.
Experiments also show that a larger memory slot size always helps.Intuitively, with a memory mechanism (either implicit or explicit), an agent could make the environment closer to fully observed by exploring and memorizing observations.Presumably, a larger memory may further improve QA-DQN's performance, but considering the average number of sentences in each iSQuAD game is 5, a memory with more than 5 slots will defeat the purpose of our study of partially observable text environments.
Not surprisingly, QA-DQN performs worse in general on iNewsQA, in all experiments.As shown in Table 2, the average number of sentences per document in iNewsQA is about 6 times more than in iSQuAD.This is analogous to games with larger maps in the RL literature, where the environment is partially observable.A better exploration (in our case, jumping) strategy may help QA-DQN to master such harder games.

Generalizing to Test Set
To study QA-DQN's ability to generalize, we select the best performing agent in each experimental setting on the validation set and report their performance on the test set.The agent's test performance is reported in sufficient information, we also report the F 1 score of an agent when it has reached the piece of text that contains the answer, which we denote as F 1info .
From Table 4 (and validation curves provided in appendix) we can observe that QA-DQN's performance during evaluation matches its training performance in most settings.F 1info scores are consistently higher than the overall F 1 scores, and they have much less variance across different settings.This supports our hypothesis that information seeking play an important role in solving iMRC tasks, whereas question answering given necessary information is relatively straightforward.This also suggests that an interactive agent that can better navigate to important sentences is very likely to achieve better performance on iMRC tasks.

Discussion and Future Work
In this work, we propose and explore the direction of converting MRC datasets into interactive environments.We believe interactive, information-seeking behavior is desirable for neural MRC systems when knowledge sources are partially observable and/or too large to encode in their entirety -for instance, when searching for information on the internet, where knowledge is by design easily accessible to humans through interaction.
Despite being restricted, our proposed task presents major challenges to existing techniques.iMRC lies at the intersection of NLP and RL, which is arguably less studied in existing literature.We hope to encourage researchers from both NLP and RL communities to work toward solving this task.
For our baseline, we adopted an off-the-shelf, topperforming MRC model and RL method.Either component can be replaced straightforwardly with other methods (e.g., to utilize a large-scale pretrained language model).
Our proposed setup and baseline agent presently use only a single word with the query command.However, a host of other options should be considered in future work.For example, multi-word queries with fuzzy matching are more realistic.It would also be interesting for an agent to generate a vector representation of the query in some latent space.This vector could then be compared with precomputed document representations (e.g., in an open domain QA dataset) to determine what text to observe next, with such behavior tantamount to learning to do IR.
As mentioned, our idea for reformulating existing MRC datasets as partially observable and interactive environments is straightforward and general.Almost all MRC datasets can be used to study interactive, information-seeking behavior through similar modifications.We hypothesize that such behavior can, in turn, help in solving real-world MRC problems involving search.

Figure 2
Figure 2: 4-action setting: QA-DQN's F 1 scores during training on iSQuAD and iNewsQA datasets with different Ctrl+F strategies and cache sizes.next and previous commands are available.

Figure 3
Figure 3: 2-action setting: QA-DQN's F 1 scores during training on iSQuAD and iNewsQA datasets when using different Ctrl+F strategies and cache sizes.Note that next and previous are disabled.

Figure 4 :
Figure 4: Performance on iSQuAD training set.next and previous actions are available.

Figure 5 :
Figure 5: Performance on iSQuAD validation set.next and previous actions are available.

Figure 6 :
Figure 6: Performance on iSQuAD training set.next and previous actions are unavailable.

Figure 7 :
Figure 7: Performance on iSQuAD validation set.next and previous actions are unavailable.

Figure 8 :
Figure 8: Performance on iNewsQA training set.next and previous actions are available.

Figure 9 :
Figure 9: Performance on iNewsQA validation set.next and previous actions are available.

Figure 10 :
Figure 10: Performance on iNewsQA training set.next and previous actions are unavailable.

Figure 11 :
Figure 11: Performance on iNewsQA validation set.next and previous actions are unavailable.
Figure1: A demonstration of the proposed iMRC pipeline, in which the QA-DQN agent is illustrated in shaddow.At a game step t, QA-DQN encodes the question and text observation into hidden representations M t .An action generator takes M t as input to generate commands to interact with the environment.If the agent generates stop at this game step, M t is used to answer question by a question answerer.Otherwise, the iMRC environment will provide new text observation in response of the generated action.

Table 4 .
In addition, to support our claim that the challenging part of iMRC tasks is information seeking rather than answering questions given

Table 4 :
Experimental results on test set.#Action 4 denotes the settings as described in the action space section, #Action 2 indicates the setting where only Ctrl+F and stop are available.F 1info indicates an agent's F 1 score iff sufficient information is in its observation.