Interactive Fiction Game Playing as Multi-Paragraph Reading Comprehension with Reinforcement Learning

Interactive Fiction (IF) games with real human-written natural language texts provide a new natural evaluation for language understanding techniques. In contrast to previous text games with mostly synthetic texts, IF games pose language understanding challenges on the human-written textual descriptions of diverse and sophisticated game worlds and language generation challenges on the action command generation from less restricted combinatorial space. We take a novel perspective of IF game solving and re-formulate it as Multi-Passage Reading Comprehension (MPRC) tasks. Our approaches utilize the context-query attention mechanisms and the structured prediction in MPRC to efficiently generate and evaluate action outputs and apply an object-centric historical observation retrieval strategy to mitigate the partial observability of the textual observations. Extensive experiments on the recent IF benchmark (Jericho) demonstrate clear advantages of our approaches achieving high winning rates and low data requirements compared to all previous approaches. Our source code is available at: https://github.com/XiaoxiaoGuo/rcdqn.


Introduction
Interactive systems capable of understanding natural language and responding in the form of natural language text have high potentials in various applications.In pursuit of building and evaluating such systems, we study learning agents for Interactive Fiction (IF) games.IF games are world-simulating software in which players use text commands to control the protagonist and influence the world, as illustrated in Figure 1.IF gameplay agents need to simultaneously understand the game's information from a text display (observation) and generate natural language command (action) via a text input interface.Without providing an explicit game strategy, the agents need to identify behaviors that maximize objective-encoded cumulative rewards.
IF games composed of human-written texts (distinct from previous text games with synthetic texts) create superb new opportunities for studying and evaluating natural language understanding (NLU) techniques due to their unique characteristics.(1) Game designers elaborately craft on the literariness of the narrative texts to attract players when creating IF games.The resulted texts in IF games are more linguistically diverse and sophisticated than the template-generated ones in synthetic text games.(2) The language contexts of IF games are more versatile because various designers contribute to enormous domains and genres, such as adventure, fantasy, horror, and sci-fi.
(3) The text commands to control characters are less restricted, having sizes over six orders of magnitude larger than previous text games.The recently introduced Jericho benchmark provides a collection of such IF games (Hausknecht et al., 2019a).The complexity of IF games demands more sophisticated NLU techniques than those used in synthetic text games.Moreover, the task of designing IF game-play agents, intersecting NLU and reinforcement learning (RL), poses several unique challenges on the NLU techniques.The first challenge is the difficulty of exploration in the huge natural language action space.To make RL agents learn efficiently without prohibitive exhaustive trials, the action estimation must generalize learned knowledge from tried actions to others.To this end, previous approaches, starting with a single embedding vector of the observation, either predict the elements of actions independently (Narasimhan et al., 2015;Hausknecht et al., 2019a); or embed each valid action as another vector and predict action value based on the vector-space similarities (He et al., 2016).These methods do not consider the compositionality or role-differences of the action elements, or the interactions among them and the observation.Therefore, their modeling of the action values is less accurate and less data-efficient.
The second challenge is partial observability.At each game-playing step, the agent receives a textual observation describing the locations, objects, and characters of the game world.But the latest observation is often not a sufficient summary of the interaction history and may not provide enough information to determine the long-term effects of actions.Previous approaches address this problem by building a representation over past observations (e.g., building a graph of objects, positions, and spatial relations) (Ammanabrolu and Riedl, 2019;Ammanabrolu and Hausknecht, 2020).These methods treat the historical observations equally and summarize the information into a single vector without focusing on important contexts related to the action prediction for the current observation.Therefore, their usages of history also bring noise, and the improvement is not always significant.
We propose a novel formulation of IF game playing as Multi-Passage Reading Comprehension (MPRC) and harness MPRC techniques to solve the huge action space and partial observability challenges.The graphical illustration is shown in Figure 2. First, the action value prediction (i.e., predicting the long-term rewards of selecting an action) is essentially generating and scoring a compositional action structure by finding supporting evidence from the observation.We base on the fact that each action is an instantiation of a template, i.e., a verb phrase with a few placeholders of object arguments it takes (Figure 2b).Then the action generation process can be viewed as extracting objects for a template's placeholders from the textual observation, based on the interaction between the template verb phrase and the relevant context of the objects in the observation.Our approach addresses the structured prediction and interaction problems with the idea of context-question attention mechanism in RC models.Specifically, we treat the observation as a passage and each template verb phrase as a question.The filling of object placeholders in the template thus becomes an extractive QA problem that selects objects from the observation given the template.Simultaneously each action (i.e., a template with all placeholder replaced) gets its evaluation value predicted by the RC model.Our formulation and approach better capture the fine-grained interactions between observation texts and structural actions, in contrast to previous approaches that represent the observation as a single vector and ignore the fine-grained dependency among action elements.
Second, alleviating partial observability is essentially enhancing the current observation with potentially relevant history and predicting actions over the enhanced observation.Our approach retrieves potentially relevant historical observations with an object-centric approach (Figure 2a), so that the retrieved ones are more likely to be connected to the current observation as they describe at least one shared interactable object.Our attention mechanisms are then applied across the retrieved multiple observation texts to focus on informative contexts for action value prediction.
We evaluated our approach on the suite of Jericho IF games, compared to all previous approaches.Our approaches achieved or outperformed the stateof-the-art performance on 25 out of 33 games, trained with less than one-tenth of game interaction data used by prior art.We also provided ablation studies on our models and retrieval strategies.

Related Work
IF Game Agents.Previous work mainly studies the text understanding and generation in parserbased or rule-based text game tasks, such as TextWorld platform (Côté et al., 2018) or custom domains (Narasimhan et al., 2015;He et al., 2016;Adhikari et al., 2020).The recent platform Jericho (Hausknecht et al., 2019a) supports over thirty human-written IF games.Earlier successes in real IF games mainly rely on heuristics without learning.NAIL (Hausknecht et al., 2019b) is the state-of-theart among these "no-learning" agents, employing a series of reliable heuristics for exploring the game, interacting with objects, and building an internal representation of the game world.With the development of learning environments like Jericho, the RL-based agents have started to achieve dominating performance.
A critical challenge for learning-based agents is how to handle the combinatorial action space in IF games.LSTM-DQN (Narasimhan et al., 2015) was proposed to generate verb-object action with pre-defined sets of possible verbs and objects, but treat the selection and learning of verbs and objects independently.Template-DQN (Hausknecht et al., 2019a) extended LSTM-DQN for template-based action generation, introducing one additional but still independent prediction output for the second object in the template.Deep Reinforcement Relevance Network (DRRN) (He et al., 2016) was introduced for choice-based games.Given a set of valid actions at every game state, DRRN projects each action into a hidden space that matches the current state representation vector for action selection.Action-Elimination Deep Q-Network (AE-DQN) (Zahavy et al., 2018) learns to predict invalid actions in the adventure game Zork.It eliminates invalid action for efficient policy learning via utilizing expert demonstration data.
Other techniques focus on addressing the partial observability in text games.Knowledge Graph DQN (KG-DQN) (Ammanabrolu and Riedl, 2019) was proposed to deal with synthetic games.The method constructs and represents the game states as knowledge graphs with objects as nodes and uses pre-trained general purposed OpenIE tool and human-written rules to extract relations between objects.KG-DQN handles the action representation following DRRN.KG-A2C (Ammanabrolu and Hausknecht, 2020) later extends the work for IF games, by adding information extraction heuristics to fit the complexity of the object relations in IF games and utilizing a GRU-based action generator to handle the action space.
Reading Comprehension Models for Question Answering.Given a question, reading comprehension (RC) aims to find the answer to the question based on a paragraph that may contain supporting evidence.One of the standard RC settings is extractive QA (Rajpurkar et al., 2016;Joshi et al., 2017;Kwiatkowski et al., 2019), which extracts a span from the paragraph as an answer.Our formulation of IF game playing resembles this setting.
Many neural reader models have been designed for RC.Specifically, for the extractive QA task, the reader models usually build question-aware passage representations via attention mechanisms (Seo et al., 2016;Yu et al., 2018), and employ a pointer network to predict the start and end positions of the answer span (Wang and Jiang, 2016).Powerful pre-trained language models (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019) have been recently applied to enhance the encoding and attention mechanisms of the aforementioned reader models.They give performance boost but are more resource-demanding and do not suit the IF game playing task very well.
Reading Comprehension over Multiple Paragraphs.Multi-paragraph reading comprehension (MPRC) deals with the more general task of answering a question from multiple related paragraphs, where each paragraph may not necessarily support the correct answer.Our formulation becomes an MPRC setting when we enhance the state representation with historical observations and predict actions from multiple observation paragraphs.
A fundamental research problem in MPRC, which is also critical to our formulation, is to select relevant paragraphs from all the input paragraphs for the reader to focus on.Previous approaches mainly apply traditional IR approaches like BM25 (Chen et al., 2017;Joshi et al., 2017), or neural ranking models trained with distant supervision (Wang et al., 2018;Min et al., 2019a), for paragraph selection.Our formulation also relates to the work of evidence aggregation in MPRC (Wang et al., 2017;Lin et al., 2018), which aims to infer the answers based on the joint of evidence pieces from multiple paragraphs.Finally, recently some works propose the entity-centric paragraph retrieval approaches (Ding et al., 2019;Godbole et al., 2019;Min et al., 2019b;Asai et al., 2019), where paragraphs are connected if they share the same-named entities.The paragraph retrieval then becomes a traversal over such graphs via entity links.These entity-centric paragraph retrieval approaches share a similar high-level idea to our object-based history retrieval approach.The techniques above have been applied to deal with evidence from Wikipedia, news collections, and, recently, books (Mou et al., 2020).We are the first to extend these ideas to IF games.
3 Multi-Paragraph RC for IF Games

Problem Formulation
Each IF game can be defined as a Partially Observable Markov Decision Process (POMDP), namely a 7-tuple of S, A, T , O, Ω, R, γ , representing the hidden game state set, the action set, the state transition function, the set of textual observations composed from vocabulary words, the textual observation function, the reward function, and the The verb phrases usually consist of several vocabulary words and each object is usually a single word.

RC Model for Template Actions
We parameterize the observation-action value function Q(o, a= verb, arg 0 , arg 1 ; θ) by utilizing the decomposition of the template actions and contextquery contextualized representation in RC.Our model treats the observation o as a context in RC and the verb=(v 1 , v 2 , ..., v k ) component of the template actions as a query.Then a verb-aware observation representation is derived via a RC reader model with Bidirectional Attention Flow (BiDAF) (Seo et al., 2016) and self-attention.The observation representation responding to the arg 0 and arg 1 words are pooled and projected to a scalar value estimate for Q(o, a= verb, arg 0 , arg 1 ; θ).A high-level model architecture of our model is illustrated in Figure 3.
Observation and verb Representation.We tokenize the observation and the verb phrase into words, then embed these words using pre-trained GloVe embeddings (Pennington et al., 2014).A shared encoder block that consists of Layer-Norm (Ba et al., 2016) and Bidirectional GRU (Cho et al., 2014) processes the observation and verb word embeddings to obtain the separate observation and verb representation.
Observation-verb Interaction Layers.Given the separate observation and verb representation, we apply two attention mechanisms to compute a verb-contextualized observation representation.We first apply BiDAF with observation as the context input and verb as the query input.Specifically, we denote the processed embeddings for observation word i and template word j as o i and t j .The attention between the two words is then , where w 1 , w 2 , w 3 are learnable vectors and ⊗ is element-wise product.We then compute the "verb2observation" attention vector for the i-th observation word as c i = j p ij t j with p ij = exp(a ij )/ j exp(a ij ).Similarly, we compute the "observation2verb" attention vector as q= i p i o i with p i = exp(max j a ij )/ i exp(max j a ij ).We concatenate and project the output vectors as followed by a linear layer with leaky ReLU activation units (Maas et al., 2013).The output vectors are processed by an encoder block.We then apply a residual self-attention on the outputs of the encoder block.The self-attention is the same as BiDAF, but only between the observation and itself.
Observation-Action Value Prediction.We generate an action by replacing the placeholders (arg 0 and arg 1 ) in a template with objects appearing in the observation.The observation-action value Q(o, a= verb, arg 0 =obj m , arg 1 =obj n ; θ) is achieved by processing each object's corresponding verb-contextualized observation representation.Specifically, we get the indices of an obj in the observation texts I(obj, o).When the object is a noun phrase, we take the index of its headword. 2ecause the same object has different meanings when it replaces different placeholders, we apply two GRU-based embedding functions for the two placeholders, to get the object's verb-placeholder dependent embeddings.We derive a single vector representation h arg 0 =obj m for the case that the placeholder arg 0 is replaced by obj m by meanpooling over the verb-placeholder dependent embeddings indexed by I(obj m , o) for the corresponding placeholder arg 0 .We apply a linear transformation on the concatenated embeddings of the two placeholders to obtain the observation action value Q(o, a)=w 5 •[h arg 0 =obj m , h arg 1 =obj n ] for a= verb, arg 0 =obj m , arg 1 =obj n .Our formulation avoids the repeated computation overhead among different actions with a shared template verb phrase.

Multi-Paragraph Retrieval Method for Partial Observability
The observation at the current step sometimes does not have full-textual evidence to support action selection and value estimation, due to the inherent partial observability of IF games.For example, when repeatedly attacking a troll with a sword, the player needs to know the effect or feedback of the last attack to determine if an extra attack is necessary.It is thus important for an agent to efficiently utilize historical observations to better support action value prediction.In our RC-based action prediction model, the historical observation utilization can be formulated as selecting evidential observation paragraphs in history, and predicting the action values from multiple selected observations, namely a Multiple-Paragraph Reading Comprehension (MPRC) problem.We propose to retrieve past observations with an object-centric approach.
Past Observation Retrieval.Multiple past observations may share objects with the current obser- vation, and it is computationally expensive and unnecessary to retrieve all of such observations.The utility of past observations associated with each object is often time-sensitive in that new observations may entirely or partially invalidate old observations.We thus propose a time-sensitive strategy for retrieving past observations.Specifically, given the detected objects from the current observation, we retrieve the most recent K observations with at least one shared object.The K retrieved observations are sorted by time steps and concatenated to the current observation.The observations from different time steps are separated by a special token.Our RC-based action prediction model treats the concatenated observations as the observation inputs, and no other parts are changed.We use the notation o t to represent the current observation and the extended current observation interchangeably.

Training Loss
We apply the Deep Q-Network (DQN) (Mnih et al., 2015) to update the parameters θ of our RC-based action prediction model.The loss function is: where D is the experience replay consisting of recent gameplay transition records and ρ is a distribution over the transitions defined by a sampling strategy.
Prioritized Trajectories.The distribution ρ has a decent impact on DQN performance.Previous work samples transition tuples with immediate positive rewards more frequently to speed up learning (Narasimhan et al., 2015;Hausknecht et al., 2019a).We observe that this heuristic is often insufficient.Some transitions with zero immediate rewards or even negative rewards are also indispensable in recovering well-performed trajectories.
We thus extend the strategy from transition level to trajectory level.We prioritize transitions from trajectories that outperform the exponential moving average score of recent trajectories.

Experiments
We evaluate our proposed methods on the suite of Jericho supported games.We compared to all previous baselines that include recent methods addressing the huge action space and partial observability challenges.

Setup
Jericho Handicaps and Configuration.The handicaps used by our methods are the same as other baselines.First, we use the Jericho API to check if an action is valid with game-specific templates.Second, we augmented the observation with the textual feedback returned by the command [inventory] and [look].Previous work also included the last action or game score as additional inputs.Our model discarded these two types of inputs as we did not observe a significant difference by our model.The maximum game step number is set to 100 following baselines.
Implementation Details.We apply spaCy 3 to tokenize the observations and detect the objects in the observations.We use the 100-dimensional GloVe embeddings as fixed word embeddings.The outof-vocabulary words are mapped to a randomly initialized embedding.The dimension of Bi-GRU hidden states is 128.We set the observation representation dimension to be 128 throughout the model.The history retrieval window K is 2. For DQN configuration, we use the -greedy strategy  The Winning percentage / counts row computes the percentage / counts of games that the corresponding agent is best.The scores of baselines are from their papers.The missing scores are represented as "-", for which games KG-A2C skipped.We also added the 100-step results from a human-written game-playing walkthrough, as a reference of human-level scores.We denote the difficulty levels of the games defined in the original Jericho paper with colors in their names -possible (i.e., easy or normal) games in green color, difficult games in tan and extreme games in red.Best seen in color.a Zork3 walkthrough does not maximize the score in the first 100 steps but explores more.b Our agent discovers some unbounded reward loops in the game Ztuu.
for exploration, annealing from 1.0 to 0.05.γ is 0.98.We use Adam to update the weights with 10 −4 learning rate.Other parameters are set to their default values.More details of the Reproducibility Checklist is in Appendix A.
Baselines.We compare with all the public results on the Jericho suite, namely TDQN (Hausknecht et al., 2019a), DRRN (He et al., 2016), and KG-A2C (Ammanabrolu and Hausknecht, 2020).As discussed, our approaches differ from them mainly in the strategies of handling the large action space and partial observability of IF games.We summarize these main technical differences in Table 1.In summary, all previous agents predict actions con-ditioned on a single vector representation of the whole observation texts.Thus they do not exploit the fine-grained interplay among the template components and the observations.Our approach addresses this problem by formulating action prediction as an RC task, better utilizing the rich textual observations with deeper language understanding.
Training Sample Efficiency.We update our models for 100, 000 times.Our agents interact with the environment one step per update, resulting in a total of 0.1M environment interaction data.Compared to the other agents, such as KG-A2C (1.6M), TDQN (1M), and DRRN (1M), our environment interaction data is significantly smaller.Table 3: Difficulty levels and characteristics of games on which our approach achieves the most considerable improvement.
Dialog indicates that it is necessary to speak with another character.Darkness indicates that accessing some dark areas requires a light source.Nonstandard Actions refers to actions with words not in an English dictionary.Inventory Limit restricts the number of items carried by the player.Please refer to (Hausknecht et al., 2019a) for more comprehensive definitions.

Overall Performance
We summarize the performance of our Multi-Paragraph Reading Comprehension DQN (MPRC-DQN) agent and baselines in Table 2. Of the 33 IF games, our MPRC-DQN achieved or improved the state of the art performance on 21 games (i.e., a winning rate of 64%).The best performing baseline (DRRN) achieved the state-of-the-art performance on only ten games, corresponding to the winning rate of 30%, lower than half of ours.Note that all the methods achieved the same initial scores on five games, namely 905, anchor, awaken, deephome, and moonlit.Apart from these five games, our MPRC-DQN achieved more than three times wins.Our MPRC-DQN achieved significant improvement on some games, such as adventureland, afflicted, detective, etc. Appendix C shows some game playing trajectories.
We include the performance of an RC-DQN agent, which implements our RC-based action prediction model but only takes the current observations as inputs.It also outperformed the baselines by a large margin.After we consider the RC-DQN agent, our MPRC-DQN still has the highest winning percentage, indicating that our RC-based action prediction model has a significant impact on the performance improvement of our MPRC-DQN and the improvement from the multi-passage retrieval is also unneglectable.Moreover, compared to RC-DQN, our MPRC-DQN has another advantage of faster convergence.The learning curves of our MPRC-DQN and RC-DQN agents on various games are in Appendix B.
Finally, our approaches, overall, achieve the new state-of-the-art on 25 games (i.e., a winning rate of 76%), giving a significant advance in the field of IF game playing.Pairwise Competition.To better understand the performance difference between our approach and each of the baselines, we adopt a direct one-to-one comparison metric based on the results from Table 2. Our approach has a high winning rate when competing with any of the baselines, summarized in Table 4.All the baselines have a rare chance to beat us on games.DRRN gives a higher chance of draw-games when competing with ours.
Human-Machine Gap.We additionally compare IF gameplay agents to human players to better understand the improvement significance and the potential improvement upper-bound.We measure each agent's game progress as the macro-average of the normalized agent-to-human game score ratios, capped at 100%.The progress of our MPRC-DQN is 28.5%, while the best performing baseline DRRN is 17.8%, showing that our agent's improvement is significant even in the realm of human players.Nevertheless, there is a vast gap between the learning agents and human players.The gap indicates IF games can be a good benchmark for the development of natural language understanding techniques.
Difficulty Levels of Games.Jericho categorizes the supported games into three difficulty levels, namely possible games, difficult games, and extreme games, based on the characteristics of the game dynamics, such as the action space size, the length of the game, and the average number of steps to receive a non-zero reward.Our approach improves over prior art on seven of the sixteen possible games, seven of the eleven difficult games, and three of the six extreme games in Table 2.It shows that the strategies of our method are generally beneficial for any difficulty levels of game dynamics.Table 3 summarizes the characteristics of the seven games in which our method improves the most, i.e., larger than 15% of the game progress in the first 100 steps. 4First, these mostly improved games have medium action space sizes, and it is an advantageous setting for our methods where modeling the template-object-observation interactions is effective.Second, our approach improves most on games with a reasonably high degree of reward sparsity, such as karn, spirit, and zork3, indicating that our RC-based value function formulation helps in optimization and mitigates the reward sparsity.Finally, we remark that these game difficulty levels are not directly categorized based on natural language-related characteristics, such as text comprehension and puzzle-solving difficulties.Future studies on additional game categories based on those natural language-related characteristics would shed light on related improvements.

Ablative Studies
RC-model Design.The overall results show that our RC-model plays a critical role in performance improvement.We compare our RCmodel to some alternative models as ablative studies.We consider three alternatives, namely The learning curves for different RC-models are 4 We ignore ztuu due to the infinite reward loops.
in Figure 4 (left/middle).The RC-models without either self-attention or argument-specific embedding degenerate, and the argument-specific embedding has a greater impact.The Transformerbased encoder block sometimes learns faster than Bi-GRU at the early learning stage.It achieved a comparable final performance, even with much greater computational resource requirements.
Retrieval Strategy.We compare with history retrieval strategies with different history sizes (K) and pure recency-based strategies (i.e., taking the latest K observations as history, denoted as w/o rec).The learning curves of different strategies are in Figure 4 (right).In general, the impact of history window size is highly game-dependent, but the pure recency based ones do not differ significantly from RC-DQN at the beginning of learning.The issues of pure recency based strategy are: (1) limited additional information about objects provided by successive observations; and (2) higher variance of retrieved observations due to policy changes.

Conclusion
We Training time.The training time is game-specific, ranging from 8 hours to 30 hours.The main factor in the time variance is the size of the combinatorial action space.
Hyper-parameters.We did not conduct extensive hyper-parameter tuning.We only tuned the learning rate of Adam from [0.001, 0.0003, 0.0001] and selected 0.0001 based on its performance on the game Zork1.
Architecture of Transformer-based block encoder.Following QANet (Yu et al., 2018), our Transformer-based block encoder consists of 1) position encoder layer, 2) layer normalization layer, 3) depthwise separable convolution layer, 4) layer normalization, 5) multi-head attention (4-head), 6) layer normalization, and 7) feedforward layer in order.The head number of multi-head attention is reduced from 8 to 4 due to memory constraints.Figure 5 shows the learning curves of our agents.The learning curves show that the differences among MPRC-DQN, RC-DQN, and the baselines are statistically significant.The learning curves also indicate that MPRC-DQN and RC-DQN outperform the most competitive baseline on most games at a very early stage of training.To the north is a nice, warm Holiday Inn hotel, where the killer is rumoured to be staying.Or you could go to his favourite hang out, the Wall, to the west, or to the east is the place where he is supposed to be working, the Doughnut King.

C Game Play Trajectories
[step 31] Act: enter north [step 31] Reward:10, Cumulative Rewards:160 [step 32] Obs: Holiday Inn You are in the Holiday Inn registration room.You talk to some suspicious guys, but they don't talk until you hold your gun to their side."Alright!Alright!I'll talk!He's on the 15th floor!That's all I can tell ya!"You shove them away.You walk up to the registration desk and show the woman there your badge.She gives you the master ring.You now have access to all of the facilitys on the 15th floor.But the problem is that the 15th floor is the suite level, and there are 30 suites, and 5 pools, 2 saunas and 5 game rooms.Big problem!Well, you have all night.You get a picture of all on the 15th floor, the people up there have to show their drivers license to be admitted, and the license is secretly xeroxed.You look at them all.Well, better get started.You see one person who stands out.You get his room number from the lady.Room 30.Now you have to find it... after a harrowing gun battle you conk him on the head and take him in.You get promoted and suddenly, with the "Jurrasic Park" theme song playing in your head, you feel proud to be an American.For special info about Exile Games, and to leave this darned game, go up.You can see a small black pistol and a food hamburger here.
[step 41] Act: down [step 41] Reward:0, Cumulative Rewards:310 [step 42] Obs: You can't go down from here!Room 30 You enter room 30... after a harrowing gun battle you conk him on the head and take him in.You get promoted and suddenly, with the "Jurrasic Park" theme song playing in your head, you feel proud to be an American.For special info about Exile Games, and to leave this darned game, go up.You can see a small black pistol and a food hamburger here.
[step 42] Act: down [step 42] Reward:0, Cumulative Rewards:310 [step 43] Obs: You can't go down from here!Room 30 You enter room 30... after a harrowing gun battle you conk him on the head and take him in.You get promoted and suddenly, with the "Jurrasic Park" theme song playing in your head, you feel proud to be an American.For special info about Exile Games, and to leave this darned game, go up.You can see a small black pistol and a food hamburger here.
[step 43] Act: down [step 43] Reward:0, Cumulative Rewards:310 repeatedly trying to go down with some occasional other actions until the max step number is reached.

C.2 Dragon
[step 0] Obs: The of the council members look grim.Gilgern continues to speak."Of course, something must be done soon," he says in that gruff, hearty voice you have come to dislike so much."Must restore public confidence and encourage the return of people to the land.Can't just leave the place to the likes of dragons and trolls.We must all make money again.Isn't as though it's just arrived, dragon's been there for years.Just that people found out about it, that's all."Marzipam looks round nervously at the others in the so called Council of the Wise."Of course, we can't afford to pay you much.We are just poor men ourselves.Think of this more as a civic duty..." he wheezes.Around you the other travellers and adventurers shift nervously.Gilgern hurriedly adjusts his glasses and glances down at the scroll in front of him."Says here that the Great Worm can be beaten," he goes on."Sword of Erondil is the only weapon which can penetrate a dragon's scaly hide.Enchanted weapon, you see.Last seen many years ago in the Castle of Abercorn.Big, ugly place.Near the sea, I think, if memory serves me right.Dragon must be getting a bit careless and lazy by now.Been there all that time.Stands to reason.Might get in a lucky hit or something."Marzipam breaks in."Or there's the spell, I suppose.The dragon is supposed to be held in place under an old enchantment, though that's perhaps just an old tale.Break the spell and the dragon will leave, the story goes..." "Well, which of you is interested?"Gilgern looks across at you and the other adventurers.You begin to wish you had never set foot in the town it just happened to be on your journey, that's all.You've no wish to tangle with dangerous dragons or anything else.As Gilgern looks down, you suddenly find that the other adventurers have all quietly taken a couple of steps back, leaving you out in the front."Splendid", chortles Gilgern before you can say anything, "We've one brave soul willing to risk their all for our little cause.Welcome, stranger..." Before you know it, you have been whisked away from the town square in a donkey and cart and deposited beside a track."Good luck you'll need it!"calls the driver over his shoulder, as he hurries away.Welcome to Dragon Adventure.You start your quest on a mountain path in the North East... (Type INFO at any time for more playing information.)Dragon Adventure A simple text adventure for players aged 9 and over, by William Stott Release 2 / Serial number 040211 / Inform v6.21 Library 6/10 Rocky Mountains It is a clear, bright morning and all seems peaceful.A tiny sparrowhawk wheels and soars lazily in the cool empty air above.There is a steep pathway to the south leading down into dense woodland, and you can make out an old building by the path.In the distance you can just glimpse the steel grey of the sea, far away to the west.You can go north, south, east or west.
[step 1] Act: yes [step 1] Reward:0, Cumulative Rewards:0 [step 2] Obs: You run wildly, as fast as you can... Swampy Ground Water seeps around the soles of your boots and fills your recent footprints.You can hear the croaking of frogs and the occasional bubble of marsh gas coming to the surface.There is an abandoned inn to the east, and a path leads south into the forest.To the west there seems to be a little country road.You can go south (along a forest path), east (towards an inn) or west (towards a road).A little man comes up to you and asks if you want to buy a box.You can see a little man and a large cardboard box here.
[step 3] Act: east [step 3] Reward:0, Cumulative Rewards:0 [step 4] Obs: Abandoned Inn An old, thatched building stands near the path.It seems to be an abandoned inn or tavern, as a faded sign reads 'Troll's Arms'.The walls are a dirty white and the paintwork is starting to peel and blister.The windows are covered in cobwebs and there is the faint odour of mildew and decay.Weeds grow between the flagstones of the path leading up to the door, which leans open.No one seems to have lived here for years.A path leads north into the mountains and there seems to be some swampy ground to the west.You can go north (to the mountains), west (towards a swamp) or east (into the inn).
[ ] Obs: Abandoned Inn An old, thatched building stands near the path.It seems to be an abandoned inn or tavern, as a faded sign reads 'Troll's Arms'.The walls are a dirty white and the paintwork is starting to peel and blister.The windows are covered in cobwebs and there is the faint odour of mildew and decay.Weeds grow between the flagstones of the path leading up to the door, which leans open.No one seems to have lived here for years.A path leads north into the mountains and there seems to be some swampy ground to the west.You can go north (to the mountains), west (towards a swamp) or east (into the inn).
[ ] Obs: Swampy Ground Water seeps around the soles of your boots and fills your recent footprints.You can hear the croaking of frogs and the occasional bubble of marsh gas coming to the surface.There is an abandoned inn to the east, and a path leads south into the forest.To the west there seems to be a little country road.You can go south (along a forest path), east (towards an inn) or west (towards a road).
[step 11] Act: enter west [step 11] Reward:0, Cumulative Rewards:1 [step 12] Obs: North End of a Road A country road stretches to the south, its gravelled surface crunching slightly beneath your boots.It is bordered by trees and old hedgerows.Towards the east is a patch of swampy ground.You can go east (towards the swamp) or south (along the road).You can see a wooden signpost here.
[step 12] Act: enter south [step 12] Reward:0, Cumulative Rewards:1 [step 13] Obs: South End of a Road A gravelled country road leads north.A white painted wooden signpost points west, where you can make out the sea.You can just glimpse the glint of more water through the trees, far to the south.You can go north (along the road), south (into the trees) or west (towards the sea).You can see a signpost here.
[step 13] Act: enter west [step 13] Reward:0, Cumulative Rewards:1 [step 14] Obs: Wasteland by a Castle A stretch of uneven waste ground stands closeby to a castle.There are old dunes covered in spiky patches of marram and couch grass.You can make out the old castle and its buildings away to the north.The blue grey of the sea curves around in a wide arc to a beach towards the south.You can go north the castle), south (to the beach) or east (towards the trees).You can see an old delivery bicycle here.
[step 14] Act: enter north [step 14] Reward:0, Cumulative Rewards:1 [step 15] Obs: Castle Forge The ruined forge stands just outside the castle.There is no roof, and the remains of thick stone walls are open to the sky.The castle itself is closeby to the north, its stone towers soaring upwards.There is desolate wasteland stretching away to the south.You can go north (towards the castle itself), or south (to some wasteland).You can see a brass lamp here.
[step 15] Act: enter north [step 15] Reward:0, Cumulative Rewards:1 [step 16] Obs: Forbidding Castle The hulk of a massive, stone castle towers into the sky above you.The walls are covered in grey lichen and moss, and seem to have stood here since the beginning of time itself.Set into the wall to the north you can see a large oak door, dark and weathered.There is a keyhole surrounded by a black iron plate.The ruins of the castle forge lie to the south.The door is locked, and there is no other way inside.The only way you can go is south (to the forge).You can see a small silver bell here.
[ Wasteland by a Castle A stretch of uneven waste ground stands closeby to a castle.There are old dunes covered in spiky patches of marram and couch grass.You can make out the old castle and its buildings away to the north.The blue grey of the sea curves around in a wide arc to a beach towards the south.You can go north (towards the castle), south (to the beach) or east (towards the trees).You can see an old delivery bicycle here.
[step 20] Act: south [step 20] Reward:0, Cumulative Rewards:1 [step 21] Obs: Pebbled Beach You hear the crashing sound of waves in your ears.White gulls wheel and play in the sky above, and you can taste the salt spray on your lips.Pebbles crunch beneath your feet as you move.You can see the ruins of an old stone lighthouse to the south.There is a wasteland of sand dunes to the north, and forest trees to the east.You can go north (towards wasteland), south (towards a lighthouse) or east (towards trees).You can see a parachute here.
[step 21] Act: enter east [step 21] Reward:0, Cumulative Rewards:1 [step 22] Obs: Lake in the Forest Dragonflies hum and hover, and you hear the occasional splash of feeding fish.You have a feeling of tranquility instilled by the placid, lazy waters.There is a little path winding between dense trees towards the south.You can see the start of a deserted gravel road to the north, and you can just make out the sea to the west.You can go south (further into the trees), north (towards the road) or west (towards the sea).You can see some long reeds here.
[ ] Obs: Lake in the Forest Dragonflies hum and hover, and you hear the occasional splash of feeding fish.You have a feeling of tranquility instilled by the placid, lazy waters.There is a little path winding between dense trees towards the south.You can see the start of a deserted gravel road to the north, and you can just make out the sea to the west.You can go south (further into the trees), north (towards the road) or west (towards the sea).You can see a big hunting horn and some long reeds here.
[step 27] Act: enter north [step 27] Reward:0, Cumulative Rewards:7 [step 28] Obs: South End of a Road A gravelled country road leads north.A white painted wooden signpost points west, where you can make out the sea.You can just glimpse the glint of more water through the trees, far to the south.You can go north (along the road), south (into the trees) or west (towards the sea).You can see a signpost here.
[step 28] Act: enter west [step 28] Reward:0, Cumulative Rewards:7 [step 29] Obs: Wasteland by a Castle A stretch of uneven waste ground stands closeby to a castle.There are old dunes covered in spiky patches of marram and couch grass.You can make out the old castle and its buildings away to the north.The blue grey of the sea curves around in a wide arc to a beach towards the south.You can go north (towards the castle), south (to the beach) or east (towards the trees).You can see an old delivery bicycle here.
[ of what lies there, long ago buried for countless ages.""And Fretsham the Fortunate?"you ask hopefully."A fatality!" the Inquisitor booms."Here," he says with evaporating patience, "take these and be of good cheer."He thrusts into your hands two things: a brass lantern of dubious quality; and a plastic sword of no antiquity or use whatsoever."But what of Kolchack the " "Enough!Kolchack is contracted.Remember, who is the boss of you!"He pauses briefly as if awaiting a reply.Then without warning, he answers his own question."Me!I am the boss of you!Now begone!"And with that, you embark on... Zork: The Undiscovered Underground, a text adventure prequel to Zork Grand Inquisitor Copyright 1997 by Activision Inc. Written by Marc Blank and Mike Berlyn, programmed by Gerry Kevin Wilson.Release 16 / Serial number 970828 / Inform v6.13 Library 6/2 Standard interpreter 1.0 New Excavation You are standing at the entrance of a narrow, waist high tunnel that winds out of sight to the northeast.This, you have been told, is the entrance to the newfound areas of the Great Underground Empire.In your left hand hangs a rather unconvincing replica of the traditional adventurer's brass lantern.In the other is an ersatz Elvish sword of no antiquity whatsoever.Its plastic blade, dull as a bread knife, barely reflects the lantern's dim and uneven glow.
[step 1] Act: blow out brass [step 1] Reward:0, Cumulative Rewards:0 [step 2] Obs: The lantern is now off.You had better hope that it comes back on when you come to your senses.
[step 2] Act: climb up northeast [step 2] Reward:0, Cumulative Rewards:0 [step 3] Obs: As you enter the tunnel, you feel a slight rumble beneath your crepe soled boots.As you turn to look back up to the exit, giant, translucent boulders begin to fall from the walls and ceiling.You duck and cover your head when the dust clears, you feel your heart sink in your chest.You are trapped!Darkness It's pitch black.You are likely to be eaten by a grue.
[step 3] Act: hit brass [step 3] Reward:5, Cumulative Rewards:5 [step 4] Obs: Your rough treatment of the lantern was enough to get it lit again.This must be your lucky day.Tunnel You are in a narrow tunnel whose entrance has been blocked by gigantic, translucent boulders.Your only option is to continue to the northeast, where the tunnel gradually widens.[Your score has just gone up by five points.] [step 4] Act: climb up northeast [step 4] Reward:0, Cumulative Rewards:5 [step 5] Obs: Cultural Complex This ante room, the center of what was apparently the cultural center of the GUE, is adorned in the ghastly style of the GUE's "Grotesque Period."With leering gargoyles, cartoonish friezes depicting long forgotten scenes of GUE history, and primitive statuary of pointy headed personages unknown (perhaps very, very distant progenitors of the Flatheads), the place would have been best left undiscovered.North of here, a large hallway passes under the roughly hewn inscription "Convention Center."To the east, under a fifty story triumphal arch, a passageway the size of a large city boulevard opens into the Royal Theater.A relatively small and unobtrusive sign (perhaps ten feet high) stands nearby.South, a smaller and more dignified (i.e.post Dimwit) path leads into what is billed as the "Hall of Science." [step 5] Act: north [step 5] Reward:0, Cumulative Rewards:5 [step 6] Obs: Convention Center Lobby You are standing at the entrance of the GUE Convention Center.On the floor, you can see occasional splotches of paint through the grime and dust.Although the whole cultural center appears to have been long abandoned, loud, low pitched sounds emanate from the door standing open to the north.And while these sounds most closely approximate those made by under lubricated granola fabricators, they nonetheless fill you with dread.Next to the door is a sign, and to the sign's left (northwest to you) is a souvenir stand.The cultural center's lobby lies to the south.
[step 6] Act: northwest [step 6] Reward:0, Cumulative Rewards:5 [step 7] Obs: Souvenir Stand You find yourself standing in front of a souvenir stand that is clearly affiliated with the Convention Center.An odd looking salesman sits behind the counter, his head resting on his hand and his eyes glued to a spot somewhere above and behind you (a quick, frightened glance over your shoulder reveals nothing eye catching; indeed, nothing at all.)A notice on the counter points toward a small, curtained room to the north.On the counter are four candy bars (a Baby Rune, a Multi Implementeers, a ZM100000 and a Forever Gores), some popcorn, some postcards and a chicken finger.You can see a salesman here.
[step 7] Act: drop sword [step 7] Reward:0, Cumulative Rewards:5 [step 8] Obs: True, the sword is not a formidable weapon, but it is the only one you've got.Oh well.
[step 8] Act: consult salesman [step 8] Reward:0, Cumulative Rewards:5 [step 9] Obs: Although not much of a conversationalist, the salesman leans forward to listen.Two things are immediately obvious: 1) He is not a regular user of toothpaste or mouthwash and 2) He is obviously wearing a mask.
[step 9] Act: dig baby with mask [step 9] Reward:8, Cumulative Rewards:13 [step 10] Obs: (first taking the mask) With the salesman's hot, fetid breath threatening to exfoliate multiple layers of facial epidermis, you quickly pull the mask from his face.In hellish harmony, two terrifying screams fill the air: yours, as the first human to see a grue; and his, as prelude to a graphic demonstration of why it is that they aren't seen by light of day (to wit: spontaneous grueish combustion, leaving only a greasy slick on the ground by your feet.)Although you will spend the rest of your life trying to forget what you have just seen, the sight of that sickly glowing fur, fish mouthed face and razor like fingers remains forever etched in your mind.

Figure 1 :
Figure 1: Sample gameplay for the classic dungeon game Zork1.The objective is to solve various puzzles and collect the 19 treasures to install the trophy case.The player receives textual observations describing the current game state and additional reward scalars encoding the game designers' objective of game progress.The player sends textual action commands to control the protagonist.

•Figure 2 :
Figure 2: Overview of our approach to solving the IF games as Multi-Paragraph Reading Comprehension (MPRC) tasks.

Figure 4 :
Figure 4: Learning curves for ablative studies.(left) Model ablative studies on the game Detective.(middle) Model ablative studies on Zork1.(right) Retrieval strategy study on Zork1.Best seen in color.

Figure 5 :
Figure 5: Learning curves of our MPRC-DQN and RC-DQN agents on various IF games.The best performing baseline results are also included as horizontal lines.

Table 1 :
Summary of the main technical differences between our agent and the baselines.All agents use DQN to update the model parameters except KG-A2C uses A2C.All agents use the same handicaps.

Table 2 :
Average game scores on Jericho benchmark games.The best performing agent score per game is in bold.
Obs: [Type "help" for more information about this version] Detective By Matt Barringer.Ported by Stuart Moore.Stuart˙Moore@my deja.comRelease 1 / Serial number 000715 / Inform v6.21 Library 6/10 SD Chief's office You are standing in the Chief's office.He is telling you "The Mayor was murdered yeaterday night at 12:03 am.I want you to solve it before we get any bad publicity or the FBI has to come in."Yessir!"Youreply.He hands you a sheet of paper.Once you have read it, go north or west.You can see a piece of white paper here.[Yourscore has just gone up by ten points.]Chief'soffice You are standing in the Chief's office.He is telling you "The Mayor was murdered yeaterday night at 12:03 am.I want you to solve it before we get any bad publicity or the FBI has to come in."Yessir!"Youreply.He hands you a sheet of paper.Once you have read it, go north Outside You are outside in the cold.To the east is a dead end.To the west is the rest of the street.Papers are blowing around.It's amazingly cold for this time of year.Mayor's house You are in the house, at the scene of the crime.You enter and flash your badge before a cop.He admits you.To the north is the upstairs.To the east is the living room and to the west is the dining room.[Yourscore has just gone up by ten points.]Upstairshallway You are in the hallway of the large house of the mayor.It is an amazingly large house.You can go north, south, east or west.[Yourscore has just gone up by ten points.]Obs:Hallway You are STILL in the hallway.There is EVEN MORE hallway to the north, and a room to the west and a room to the east of you.Hallway You are still in the hallway.You can go north to where there is a police officer who will let you outside, or you can go east or west.[step13] Act: north [step 13] Reward:10, Cumulative Rewards:60 [step 14] Obs: Outside You pass the guard.He nods at you.You are now outside standing on the street.You can go north and east, your choice.To the north is more of the street, and to the east is a video store.[Your score has just gone up by ten points.]Obs: McDonalds You are in a McDonalds.You pay the guy behind the counter.Now there is a hamburger there.When you have picked it up, go north.You can see a food hamburger here.Music Store You are in a music store.You ask the man behind the counter if he knew any information."Uhh... nope!But the guy back there might be able ta help."You politly thank him and head to the back.You can only go north.[Your score has just gone up by ten points.]You ask the guy who's looking at the cool tapes.He looks up at you."Duh.. no... don't t'ink so... lemme see..."You decide that he's no help.To the west there is a dazed looking man and to the north there is an exit.[Your score has just gone up by ten points.][step 26] Act: enter north [step 26] Reward:10, Cumulative Rewards:120 [step 27] Obs: Alley You are in an alley.A drunken man stagers up to you and says "Boycott FDR! HICKUP .".You just walk away.You can go north, east or west.Your call.[Your score has just Police Station You are in the 3rd precinct police station.This isn't your station.You get admitance from the guy at the desk and go to the holding cells.You ask each offender if they know anything.You promise a lighter sentence for the ones who help.But one guy really sets you straight."I got caught wit' t'ree ounces o' crack.I'm supposed to get 20 years but I'll be out in 2.You can't make me talk cuz it don't matter to me.If I squeal, da guys who did it are gonna come lookin' for me.I know but I ain't gonna tell ya.Now git outta my face.".You are surprised but used to it.You can go north to the outside, south to go back to the alley and west or east to talk to more guys.[Your score has just gone up by ten points.][step 28] Act: north [step 28] Reward:10, Cumulative Rewards:140 [step 29] Obs: Outside You are outside.it's bitter cold and you pull your jacket around yourself.To the north is a nice, warm Holiday Inn hotel, where the killer is rumoured to be staying.Or you could go to his favourite hang out, the Wall, to the west, or to the east is the place where he is supposed to be working, the Doughnut King.[Your score has just gone up by ten points.][step 29] Act: east [step 29] Reward:10, Cumulative Rewards:150 [step 30] Obs: Doughnut King You are in the Doughnut King, where the greasiest doughnuts on earth reside.He isn't here, no one seems to be for that matter, so you should go west.[Your score has just gone up by ten points.] [step 12] Act: enter north [step 12] Reward:0, Cumulative Rewards:50 [step 13] Obs: [step 25] Act: north [step 25] Reward:10, Cumulative Rewards:110 [step 26] Obs: Back of Music Store You are in the back of the music store.[step 30] Act: enter west [step 30] Reward:0, Cumulative Rewards:150 [step 31] Obs: Outside You are outside.it's bitter cold and you pull your jacket around yourself.
Obs: Holiday Inn 15th Floor You go up the elevator.When you step out, you see the wallpaper is pink, with little flowers on it.You can go east or west.[Your score has just gone up by ten points.][step 33] Act: enter west [step 33] Reward:10, Cumulative Rewards:180 [step 34] Obs: Hallway You are in the hallway.You see many doors... 1... 2... 3... 4... 5... 6... 7... boy, you have a long way to go.You can only go north.[Your score has just gone up by ten points.]Hallway You are in the hallway.You see numbers flash by as you run through the halls.19... 20... 21... 22... you are getting close!You can only go north.[Your score has just gone up by ten points.][step 36] Act: north [step 36] Reward:10, Cumulative Rewards:210 [step 37] Obs: Hallway You are in the hallway.You feel the heat from the sauna to the west, and to the east is a door marked "Pool A".To the north is more hall.[Your score has just gone up by ten points.][step 37] Act: enter north [step 37] Reward:100, Cumulative Rewards:310 [step 38] Obs: Room 30 You enter room 30... after a harrowing gun battle you conk him on the head and take him in.You get promoted and suddenly, with the "Jurrasic Park" theme song playing in your head, you feel proud to be an American.For special info about Exile Games, and to leave this darned game, go up.[Your score has just gone up by one hundred points.] The little man takes the coin and leaves you the cardboard box.He runs off, laughing... [Your score has just gone up by one point.][step5] Act: east [step 5] Reward:0, Cumulative Rewards:1 [step 6] Obs: Inside the Inn There are a few patches of mould on the walls and the floorboards creak slightly, but the building seems safe enough.As you move you disturb a thin layer of dust.There are some stone steps leading down, and an open doorway to the outside to the west.You see some tables and chairs.There is a dark, intricately carved wooden chair in the corner behind you.You can go west (back outside) or down some steps.You can see a magnificently carved chair here.
Obs: Castle Forge The ruined forge stands just outside the castle.There is no roof, and the remains of thick stone walls are open to the sky.The castle itself is closeby to the north, its stone towers soaring upwards.There is desolate wasteland stretching away to the south.You can go north (towards the castle itself), or south (to some wasteland).You can see a brass lamp here.
Obs: (to the huge Troll with a club) The Troll stops in surprise and looks at you curiously."You're not another of these sneaky thieves, then...?" he says."No one has bothered to speak to me for years.Nowadays, people just come to steal and kill.Thank you for finding my bell.My old hunting horn may be useful to you, I'll let you borrow it, if you like.I'll be off, then."The Troll leaves you his huge hunting horn, then walks off.[Your score has just gone up by five points.]Tree Stump The twisting path leads through an ancient forest.Red and blue butterflies dance above the grass and you can hear the soft calls of wood pigeons in the trees.The leaves rustle gently in the morning breeze.The pathway leads east deeper into the forest and continues north, winding into the trees.There is an old tree stump next to the path where once a great oak tree must have stood.It seems to be hollow in the centre.You can go north (along a path) or east (further into the forest).You can see a hollow tree stump here.
Castle Forge The ruined forge stands just outside the castle.There is no roof, and the remains of thick stone walls are open to the sky.The castle itself is closeby to the north, its stone towers soaring upwards.There is desolate wasteland stretching away to the south.You can go north (towards the castle itself), or south (to some wasteland).You can see a brass lamp here.Obs: Forbidding Castle The hulk of a massive, stone castle towers into the sky above you.The walls are covered in grey lichen and moss, and seem to have stood here since the beginning of time itself.Set into the wall to the north you can see a large oak door, dark and weathered.There is a keyhole It is with great hope and expectation I send you on this mission," the Grand Inquisitor intones, his voice booming like distant, rumbling thunderheads."We have discovered what appears to be a piece of the Great Underground Empire, never before explored.I would have sent Grubald the Bold, but he's busy.As is Matchlick the Mighty.So, 'tis you.""What of Linklaw the Lucky?" you ask shyly."Laid up.I'm afraid 'tis you and none other.A simple, solo quest, one of fun and mirth, mischief and merriment, a jaunt, an excursion, a simple outing.Merely explore, enjoy yourself, and bring back news You throw the mask away in horror at what you've just seen.[Your score has just gone up by eight points.][step 10] Act: dig baby with forever [step 10] Reward:1, Cumulative Rewards:14 [step 11] Obs: (first taking the Forever Gores) [Your score has just gone up by one point.]Digging would achieve nothing here.