Keep CALM and Explore: Language Models for Action Generation in Text-based Games

Text-based games present a unique challenge for autonomous agents to operate in natural language and handle enormous action spaces. In this paper, we propose the Contextual Action Language Model (CALM) to generate a compact set of action candidates at each game state. Our key insight is to train language models on human gameplay, where people demonstrate linguistic priors and a general game sense for promising actions conditioned on game history. We combine CALM with a reinforcement learning agent which re-ranks the generated action candidates to maximize in-game rewards. We evaluate our approach using the Jericho benchmark, on games unseen by CALM during training. Our method obtains a 69% relative improvement in average game score over the previous state-of-the-art model. Surprisingly, on half of these games, CALM is competitive with or better than other models that have access to ground truth admissible actions. Code and data are available at https://github.com/princeton-nlp/calm-textgame.


Introduction
Text-based games have proven to be useful testbeds for developing agents that operate in language. As interactions in these games (input observations, action commands) are through text, they require solid language understanding for successful gameplay. While several reinforcement learning (RL) models have been proposed recently (Narasimhan et al., 2015;He et al., 2015;Hausknecht et al., 2019a;Ammanabrolu and Riedl, 2019), combinatorially large action spaces continue to make these games challenging for these approaches.
The action space problem is exacerbated by the fact that only a tiny fraction of action commands are admissible in any given game state. An admissible action is one that is parseable by the game Figure 1: Sample gameplay from ZORK1 along with action sets generated by two variants of CALM. The game recognizes a vocabulary size of 697, resulting in more than 697 4 ≈ 200 billion potential 4-word actions. 'move rug' is the optimal action to take here and is generated by our method as a candidate. engine and changes the underlying game state. For example, in Figure 1, one can observe that randomly sampling actions from the game vocabulary leads to several inadmissible ones like 'north a' or 'eat troll with egg'. Thus, narrowing down the action space to admissible actions requires both syntactic and semantic knowledge, making it challenging for current systems.
Further, even within the space of admissible actions, it is imperative for an autonomous agent to know which actions are most promising to advance the game forward, and explore them first. Human players innately display such game-related common sense. For instance in Figure 1, players might prefer the command "move rug" over "knock on door" since the door is nailed shut. However, even the state-of-the-art game-playing agents do not incorporate such priors, and instead rely on rule-based heuristics (Hausknecht et al., 2019a) or handicaps provided by the learning environment (Hausknecht et al., 2019a;Ammanabrolu and Hausknecht, 2020) to circumvent these issues.
In this work, we propose the Contextual Action Language Model (CALM) to alleviate this challenge. Specifically, at each game step we use CALM to generate action candidates, which are fed into a Deep Reinforcement Relevance Network (DRRN) (He et al., 2015) that uses game rewards to learn a value function over these actions. This allows our model to combine generic linguistic priors for action generation with the ability to adaptively choose actions that are best suited for the game.
To train CALM, we introduce a novel dataset of 426 human gameplay transcripts for 590 different text-based games. While these transcripts are noisy and actions are not always optimal, they contain a substantial amount of linguistic priors and game sense. Using this dataset, we train a single instance of CALM and deploy it to generate actions across many different downstream games. Importantly, in order to demonstrate the generalization of our approach, we do not use any transcripts from our evaluation games to train the language model. We investigate both n-gram and state-of-the-art GPT-2 (Radford et al., 2019) language models and first evaluate the quality of generated actions in isolation by comparing against ground-truth sets of admissible actions. Subsequently, we evaluate the quality of CALM in conjunction with RL over 28 games from the Jericho benchmark (Hausknecht et al., 2019a). Our method outperforms the previous state-of-the-art method by 69% in terms of average normalized score. Surprisingly, on 8 games our method even outperforms competing methods that use the admissible action handicap -for example, in the game of INHUMANE, we achieve a score of 25.7 while the state-of-the-art KG-A2C agent (Ammanabrolu and Hausknecht, 2020) achieved 3.
In summary, our contributions are two-fold. First, we propose a novel learning-based approach for reducing enormous action spaces in text-based games using linguistic knowledge. Second, we introduce a new dataset of human gameplay transcripts, along with an evaluation scheme to measure the quality of action generation in these games.

Related Work
Reinforcement Learning for Text-based Games Early work on text-based games (Narasimhan et al., 2015;He et al., 2015) developed RL agents on synthetic environments with small, pre-defined text action spaces. Even with small actions spaces (e.g. < 200 actions), approaches to filter inadmissible actions (Zahavy et al., 2018;Jain et al., 2019) led to faster learning convergence. Recently, Hausknecht et al. (2019a) introduced Jericho -a benchmark of challenging man-made text games. These games contain significantly greater linguistic variation and larger action spaces compared to frameworks like TextWorld (Côté et al., 2018).
To assist RL agents, Jericho provides a handicap that identifies admissible actions at each game state. This has been used by approaches like DRRN (He et al., 2015) as a reduced action space. Other RL agents like TDQN (Hausknecht et al., 2019a) and KGA2C (Ammanabrolu and Hausknecht, 2020) rely on the handicap for an auxiliary training loss. In general, as these RL approaches lack linguistic priors and only learn through in-game rewards, they are reliant on the admissible-action handicap to make the action space tractable to explore.
Linguistic Priors for Text-based Games A different line of work has explored various linguistic priors for generating action commands. Fulda et al. (2017) used Word2vec (Mikolov et al., 2013 embeddings to infer affordance properties (i.e. verbs suitable for an object). Other approaches (Kostka et al., 2017;Hausknecht et al., 2019b) trained simple n-gram language models to learn affordances for action generation. Perhaps most similar to our work is that of Tao et al. (2018), who trained seq2seq (Sutskever et al., 2014) models to produce admissible actions in synthetic TextWorld (Côté et al., 2018) games. In a slightly different setting, Urbanek et al. (2019) trained BERT (Devlin et al., 2018) to generate contextually relevant dialogue utterances and actions in fantasy settings. However, these approaches are game-specific and do not use any reinforcement learning to optimize gameplay. In contrast, we combine strong linguistic priors with reinforcement learning, and use a modern language model that can generate complex actions and flexibly model the dependency between actions and contexts. We also train on multiple games and generalize to unseen games. Generation in Text-based Games and Interactive Dialog Besides solving games, researchers have also used language models to create textbased games. Ammanabrolu et al. (2019) used Markov chains and neural language models to procedurally generate quests for TextWorld-like games. AI Dungeon 2 (Walton, 2019) used GPT-2 to generate narrative text in response to arbitrary text actions, but lacked temporal consistency over many steps.
More broadly, the concept of generating candidates and re-ranking has been studied in other interactive lanugage tasks such as dialogue (Zhao and Eskenazi, 2016;Williams et al., 2017;Song et al., 2016;Chen et al., 2017) and communication games (Lazaridou et al., 2020). These approaches often focus on improving aspects like fluency and accuracy of the generated utterances, whereas our re-ranking approach only aims to maximize future rewards in the task. Also, our CALM pre-trained model generalizes to new environments without requiring any re-training.

Background
A text-based game can be formally specified as a partially observable Markov decision process (POMDP) (S, T, A, O, R, γ), where a player issues text actions a ∈ A and receives text observations o ∈ O and scalar rewards r = R(s, a) at each step. Different games have different reward designs, but typically provide sparse positive rewards for solving key puzzles and advancing the story, and negative rewards for dying. γ ∈ [0, 1] is the reward discount factor. Latent state s ∈ S contains the current game information (e.g. locations of the player and items, the player's inventory), which is only partially reflected in o. The transition function s = T (s, a) specifies how action a is applied on state s, and a is admissible at state s if T (s, a) = s (i.e. if it is parseable by the game and changes the state). S, T and R are not provided to the player. Reinforcement Learning One approach to developing text-based game agents is reinforcement learning (RL). The Deep Reinforcement Relevance Network (DRRN) (He et al., 2015) is an RL algorithm that learns a Q-network Q φ (o, a) parametrized by φ. The model encodes the observation o and each action candidate a using two separate encoders f o and f a (usually recurrent neural networks such as GRU (Cho et al., 2014)), and then aggregates the representations to derive the Q-value through a decoder g: For learning φ, tuples (o, a, r, o ) of observation, action, reward and the next observation are sampled from an experience replay buffer and the following temporal difference (TD) loss is minimized: During gameplay, a softmax exploration policy is used to sample an action: While the above equation contains only a single observation, this can also be extended to a policy π(a|c) conditioned on a longer context c = (o 1 , a 1 , ..., o t ) of previous observations and actions till current time step t. Note that when the action space A is large, (2) and (3) become intractable.

Contextual Action Language Model (CALM)
To reduce large action spaces and make learning tractable, we train language models to generate compact sets of actions candidates. Consider a dataset D of N trajectories of human gameplay across different games, where each trajectory of length l consists of interleaved observations and actions (o 1 , a 1 , o 2 , a 2 , · · · , o l , a l ). The context c t at timestep t is defined as the history of observations and actions, i.e. c t = (o 1 , a 1 , ..., a t−1 , o t ). In practice, we find that a window size of 2 works well, i.e. c t = (o t−1 , a t−1 , o t ). We train parametrized language models p θ to generate actions a conditioned on contexts c. Specifically, we use all N trajectories and minimize the following cross-entropy loss: Since each action a is typically a multi-word phrase consisting of m tokens a 1 , a 2 , · · · , a m , we can further factorize the right hand side of (4) as: Thus, we can simply use the cross-entropy loss over each token a i in action a during training. We investigate two types of language models: 1. n-gram: This model simply uses n-gram counts from actions in D to model the following probability: where cnt(a i , · · · , a j ) counts the number of occurrences of the action sub-sequence (a i , · · · , a j ) in the training set, α is a smoothing constant, and V is the token vocabulary. Note that this model is trained in a context-independent way and only captures basic linguistic structure and common affordance relations observed in human actions. We optimize the parameters (n, α) to minimize the perplexity on a held-out validation set of actions.
To generate top actions given context c, we construct a restricted action space A c = V × B c , where V is the set of verb phrases (e.g. open, knock on) collected from training actions, and B c is the set of nouns (e.g. door) detected in c using spaCy's † noun-phrase detection. Then we calculate p (n,α) (a) for each a ∈ A c and choose the top ones.
2. GPT-2 (Radford et al., 2019): We use a pretrained GPT-2 and train it on D according to (4) and (5). Unlike the previous n-gram model, GPT-2 helps model dependencies between the context and the action in a flexible way, relying on minimal assumptions about the structure of actions. We use beam search to generate most likely actions.

Reinforcement Learning with CALM
Though language models learn to generate useful actions, they are not optimized for gameplay performance. Therefore, we use CALM to generate top-k action candidates A LM (c, k) ⊂ A given context c, and train a DRRN to learn a Q-function over this action space. This can be done by simply replacing A with A LM (c, k) in equations (2) and (3). In this way, we combine the CALM's generic action priors with the ability of RL to learn policies optimized for the gameplay. We choose not to fine-tune CALM in RL so as to avoid overfitting to a specific game and invalidate the general priors present in CALM.
To summarize, we employ CALM for providing a reduced action space for text adventure agents to explore efficiently. Even though we choose a specific RL agent (DRRN) in our experiments, CALM is simple and generic, and can be combined with any RL agent. crawl 426 transcripts covering 590 games (in some transcripts people play more than one game), and build a dataset of 223,527 context-action pairs {((o t−1 , a t−1 , o t ), a t )}. We pre-process the data by removing samples with meta-actions (e.g. 'save','restore') or observations with over 256 tokens. Figure 3 visualizes the action and observation length distributions. We also note that a few common actions (e.g. 'north', 'take all', 'examine') make up a large portion of the data. More details on the dataset are in the supplementary material.
Game Environment To test our RL agents, we use 28 man-made text games from the Jericho framework (Hausknecht et al., 2019a). We augment state observations with location and inventory descriptions by issuing the 'look' and 'inventory' commands, following the standard practice described in Hausknecht et al. (2019a). The Jericho framework implements an admissible action handicap by enumerating all combinations of game verbs and objects at each state, and testing each action's admissibility by accessing the underlying simulator states and load-and-save functions. As a result, the handicap runs no faster than a GPT-2 inference pass, and could in fact be unavailable for games outside Jericho. Jericho also provides an optimal walkthrough trajectory to win each game. Table 1 provides statistics of the ClubFloyd Dataset and the Jericho walkthroughs. We observe that ClubFloyd has a much larger vocabulary and a diverse set of games, which makes it ideal for training CALM. We utilize Jericho walkthroughs in our standalone evaluation of CALM in § 5.1.

CALM Setup
Training For training CALM (n-gram), we condition only on the current observation, i.e. c t = o t instead of c t = (o t−1 , a t−1 , o t ), since o t−1 and a t−1 may contain irrelevant objects to the current state. We split the dataset into 90% training set and 10% validation set, and choose n and α based on the validation set perplexity. We find a bi-gram model n = 2, α = 0.00073 works best, achieving a per-action perplexity of 863, 808 on the validation set and 17, 181 on the training set. For CALM (GPT-2), we start with a 12-layer, 768-hidden, 12-head, 117M parameter GPT-2 model pre-trained on the WebText corpus (Radford et al., 2019). The implementation and pretrained weights of this model are obtained from Wolf et al. (2019). We then train it on the ClubFloyd transcripts for 3 epochs to minimize (4). We split the dataset into 90% training set and 10% validation set and we obtain a training loss of 0.25 and a validation loss of 1.98. Importantly, both models are trained only on transcripts that do not overlap with the 28 Jericho games we evaluate on.
Generating Top Actions For every unique state of each game, we generate the top k = 30 actions. For CALM (n-gram), we enumerate all actions in A c plus 13 one-word directional actions (e.g. 'north', 'up', 'exit'). To encourage action diversity, at most 4 actions are generated for each object b ∈ B c . For CALM (GPT-2), we use beam search with a beam size of 40, and then choose the top 30 actions.

RL Agent Setup
Training We use DRRN (He et al., 2015) to estimate Q-Values over action candidates generated by CALM. Following Hausknecht et al. (2019a), we use a FastText model (Joulin et al., 2017) to predict the admissibility of an action based on the game's textual response and filter out candidate actions that are found to be inadmissible. We train the DRRN asynchronously on 8 parallel instances of the game environment for 10 6 steps in total. Following Narasimhan et al. (2015), we use a separate experience replay buffer to store trajectories with the best score at any point of time. The final score of a training run is taken to be the average score of the final 100 episodes during training. For each game, we train five independent agents with different random seeds and report the average score. For model variants in § 5.3 we only run one trail.
Baselines We compare with three baselines: 1. NAIL (Hausknecht et al., 2019b): Uses handwritten rules to act and explore, therefore requires no reinforcement learning or oracle access to admissible actions.
2. DRRN (He et al., 2015): This RL agent described in § 3.1 uses ground-truth admissible actions provided by the Jericho handicap.
3. KG-A2C (Ammanabrolu and Hausknecht, 2020): This RL agent constructs a game knowledge graph to augment the state space as well as constrain the types of actions generated. During learning, it requires the admissible action handicap to guide its exploration of the action space.
Of these methods, DRRN and KG-A2C require ground-truth admissible actions, which our model does not use, but we add them as reference comparisons for completeness.

Evaluating CALM on walkthroughs
Metrics like validation loss or accuracy on validation set of our ClubFloyd data are not sufficient to evaluate CALM (see supplementary material for details on these metrics). This is because: 1) there can be multiple admissible actions in each state, and 2) the human actions in the trajectories are not guaranteed to be optimal or even admissible. Therefore, we use the walkthroughs provided in Jericho to provide an additional assessment on the quality of actions generated by CALM.
Consider a walkthrough to be an optimal trajectory (o 1 , a 1 , · · · , o l , a l ) leading to the maximum score achievable in the game.
the gold action is a t and the full set of admissible actions A t is obtained from the Jericho handicap. Suppose the generated set of top-k actions at step t is A LM (c t , k). We then calculate the average precision of admissible actions (prec a ), recall of admissible actions (rec a ), and recall of gold actions (rec g ) as follows: We calculate these metrics on each of the 28 games and present the averaged metrics as a function of k in Figure 4. The rec a curve shows that the top k = 15 actions of CALM (GPT-2 and n-gram) are both expected to contain around 30% of all admissible actions in each walkthrough state. However, when k goes from 15 to 30, CALM (GPT-2) can come up with 10% more admissible actions, while the gains are limited for CALM (n-gram). When k is small, CALM (n-gram) benefits from its strong action assumption of one verb plus one object. However, this assumption also restricts CALM (n-gram) from generating more complex actions (e.g. 'open case with key') that CALM (GPT-2) can produce. This can also be seen in the rec g curve, where the top-30 actions from CALM (GPT-2) contain the gold action in 20% more game states than CALM (n-gram). This gap is larger when it comes to gold actions, because they are more likely to be complex actions that the CALM (n-gram) is  Table 2: Performance of our models (CALM (GPT-2) and CALM (n-gram)) compared to baselines (NAIL, KG-A2C, DRRN) on Jericho. We report raw scores for individual games as well as average normalized scores (avg. norm). Advent and Deephome's initial scores are 1 and 36, respectively. Underlined games represent those where CALM outperforms handicapassisted methods KGA2C and DRRN.
unable to model. Further, we note that as k increases, the average quality of the actions decreases (prec a curve), while they contain more admissible actions (rec a curve). Thus, k plays an important role in balancing exploration (more admissible actions) with exploitation (a larger ratio of admissible actions) for the RL agent, which we demonstrate empirically in § 5.3. We provide several examples of generated actions from both models in the supplementary material.

Evaluating gameplay on Jericho
We provide scores of our CALM-augmented DRRN agent on individual games in Table 2. To take into account different score scales across games, we consider both the raw score and the normalized score (raw score divided by maximum score), and only report the average normalized score across games.
Of the handicap-free models, CALM (n-gram)  achieves similar performance to NAIL, while CALM (GPT-2) outperforms CALM (n-gram) and NAIL by 4.4% and 3.8% on absolute normalized scores, respectively. Relatively, this represents almost a 69% improvement over NAIL. Figure 5 presents a game-wise comparison between CALM (GPT-2) and NAIL. Surprisingly, even when compared to handicapassisted models, CALM (GPT-2) performs quite well. On 8 out of 28 games (underlined in Table 2), CALM (GPT-2) outperforms both DRRN and KG-A2C despite the latter having access to groundtruth admissible actions. This improvement is especially impressive on games like DETECTIVE, IN-HUMANE and SNACKTIME, where our normalized score is higher by more than 20%. We hypothesize CALM excludes some non-useful admissible actions like "throw egg at sword" that humans never issue, which can speed up exploration. Also, it is possible that CALM sometimes discover admissible actions even the handicap cannot (due to the imperfection of state change detection).

Analysis
What Factors Contribute to Gameplay? We now analyze various components and design choices made in CALM (GPT-2). First, we investigate how much of the model's performance is due to pre-training on text corpora as opposed to training on our ClubFloyd data. Then, we vary the number of actions (k) generated by the model. We also consider combing CALM with a random agent instead of RL. This leads us to the following variants: 1. CALM (X%): These variants are trained with only X% of the transcripts from ClubFloyd. X = 0 is equivalent to using a pre-trained GPT- There is a lot of potential for developing better algorithms to learn from high-scoring trajectories.
2 model off-the-shelf -we find that this fails to produce actions that are even parseable by the game engine and therefore is not reported in the table.
2. CALM (w/ Jericho): This variant is trained on additional ClubFloyd data that includes 8 scripts from games contained in Jericho.
3. CALM (w/o PT): This is a randomly initialized GPT-2 model, instead of a pre-trained one, trained on ClubFloyd data. We train this model for 10 epochs until the validation loss converges, unlike previous models which we train for 3 epochs.
4. CALM (k = Y ): This is a model variant that produces action sets of size Y . 5. CALM (random agent): This model variant replaces DRRN by a random agent that samples uniformly from CALM top-30 actions at each state.
As shown in Table 3, the significant drop in score for CALM without pretraining shows that both pre-training and ClubFloyd training are important for gameplay performance. Pre-training provides general linguistic priors that regularize action generation while the ClubFloyd data conditions the model towards generating actions useful in text-based games.
Adding heldout transcripts from Jericho evaluation games (CALM w/ Jericho) provides additional benefit as expected, even surpassing handicapassisted KG-A2C in terms of the average normalized score. Counter-intuitively, we find that the greatest performance gains aren't on games fea-tured in the heldout transcripts. See supplementary material for more details.
For the models with different k values, CALM (k = 10) is much worse than other choices, but similar to CALM (n-gram) in Table 2. Note that in Figure 4 the recall of admissible actions is similar between GPT-2 and n-gram when k ≤ 10. We believe it is because top-10 GPT-2 actions are usually simple actions that occur a lot in ClubFloyd (e.g. 'east', 'get object'), which is also what n-gram can capture. It is really the complex actions captured when k > 10 that makes GPT-2 much better than ngram. On the other hand, though k = 20, 30, 40 achieve similar overall performance, they achieve different results for different games. So potentially the CALM overall performance can be further improved by choosing different k for different games. Finally, CALM (random agent) performs a poor score of 1.8%, and clearly shows the importance of combining CALM with an RL agent to adaptively choose actions.
Is CALM limiting RL? A natural question to ask is whether reducing the action space using CALM results in missing key actions that may have led to higher scores in the games. To answer this, we also plot the maximum scores seen by our CALM (GPT-2) agent during RL in Figure 6. Some games (e.g. 905, ACORNCOURT) are intrinsically hard to achieve any score. However, on other games with non-zero scores, DRRN is unable to stably converge to the maximum score seen in RL exploration. If RL can fully exploit and learn from the trajectories experienced under the CALM action space for each game, the average normalized score would be 14.7%, higher than any model in Table 2, both with and without handicaps.

Conclusion
In this paper, we proposed the Contextual Action Language Model (CALM), a language model approach to generating action candidates for reinforcement learning agents in text-based games. Our key insight is to use language models to capture linguistic priors and game sense from humans gameplay on a diverse set of games. We demonstrated that CALM can generate high-quality, contextuallyrelevant actions even for games unseen in its training set, and when paired with a DRRN agent, outperforms previous approaches on the Jericho benchmark (Hausknecht et al., 2019a) by as much as 69% in terms of average normalized score. Remarkably, on many of these games, our approach is competitive even with models that use ground truth admissible actions, implying that CALM is able to generate high-quality actions across diverse games and contexts.
From the results in Table 2, it is safe to conclude that text-based games are still far from being solved. Even with access to ground truth admissible actions, sparse rewards and partial observability pose daunting challenges for current agents. In the future, we believe that strong linguistic priors will continue to be a key ingredient for building nextlevel learning agents in these games. By releasing our dataset and code we hope to provide a solid foundation to accelerate work in this direction.

A ClubFloyd Dataset
The ClubFloyd transcripts we collected are gameplay logs generated by a group of people that regularly meet to play interactive fiction games. The participants are experienced at playing text-based games, however they may not be familiar with the game that's being played, and do make several mistakes. We include a snippet of a transcript in Figure  7. We crawled the ClubFloyd website to acquire 426 transcripts, spanning over 500 games.
To process a transcript, we clean the data and extract observations and actions. The data contains several sources of noise, which we remove: the first is non-game information such as chat logs between the humans playing the games; second are metaactions that humans use to save and load games and navigate menus; and finally, we remove typos, expand common abbreviations ("n" to "north", "x" to "examine", etc.), and filter out any actions that weren't recognized by the game parsers.
Once we have our cleaned observations and actions, we group observations and actions into the  form (o j−1 , a j−1 , o j ), a j . For the very first observation and action, we pad the beginning of the example with the observation "You are at the start of your journey" and the action "begin journey".
After this entire pre-processing, the dataset contains 223,527 examples.

B CALM Training
In this section, we will provide training details of CALM (GPT-2), CALM (n-gram), and their variants.

B.1 CALM (GPT-2)
We first discuss the CALM (GPT-2) models, and begin with the portion of the ClubFloyd data that they are trained on. We begin with a 12-layer, 768hidden, 12-head, 117M parameter pretrained Ope-nAI GPT-2 model. We note that the number of samples we train on, even in the CALM (GPT-2) model + Jericho games variant, is less than the total samples in the dataset. This is because we do not train on incomplete batches of data, and we omit samples that exceed 256 tokens.
CALM (GPT-2) To train CALM (GPT-2), we take transcripts from ClubFloyd (excluding Jericho games) and order the samples based on the transcript number they came from. This yields a dataset of 193,588 samples. We select the first 90% of the samples as train data, and the last 10% of the samples as validation data.
CALM (GPT-2) 50%, 20%, (+) Jericho To train the 50% and 20% variants, we select without replacement 212 transcripts (94,609 samples), and 85 transcripts (38,334 samples) respectively from the ClubFloyd transcripts (excluding Jericho games). We order the samples based on the transcript they come from, choose the first 90% of the data as our training data and last 10% as validation data.
For the CALM (GPT-2) variant including Jericho games, we include every ClubFloyd transcript, we randomly order the transcripts, order the samples based on the order of the transcripts, and then we select the first 90% of the data as our training data, and the last 10% of the data as validation data. This split contains 206,286 samples.

CALM (GPT-2) Random Initialization
For the CALM (GPT-2) variant with random initialization, we begin with a GPT-2 model that has not been pretrained. We only use the transcripts in ClubFloyd that do not correspond to any Jericho game we test on. We randomly order the transcripts, and order the samples based on the order of the transcripts. We select the first 90% of the data as our training data, and the last 10% of the data as validation data.
Parameter Optimization In order to train GPT-2, we minimize the cross-entropy between GPT-2's distribution over actions and the action taken in the ClubFloyd example. We use Adam to optimize the weights of our model with learning rate = 2e-5 and Adam epsilon = 1e-8. For the learning rate we use a linear schedule with warmup. Finally, we clip gradients allowing a max gradient norm of 1.
We include the loss on the train and validation set, as well as the accuracy (defined as the percentage of examples on which the action assigned the  highest probability by GPT-2 was the ClubFloyd action) in Table 4.

B.2 CALM (n-gram)
In order to train the CALM n-gram model, we consider the set of transcripts in ClubFloyd (excluding Jericho games). Next, we take the set of actions that appear in these transcripts, and train an n-gram model with Laplace α smoothing to model these sequences (Jurafsky and Martin, 2009). We order actions by the transcript they appear in and take the first 70% of the actions as train data and leave the remaining 30% as validation data. For each n, we choose alpha that minimizes perplexity per word on the validation data. We also tried a linear interpolation of these estimates (Jurafsky and Martin, 2009) although we did not observe an improvement over our bigram model. In this model, we estimate p(a i |a i−3 , a i−2 , a i−1 ) = w 1 p * (a i |a i−3 , a i−2 , a i−1 )+w 2 p * (a i |a i−2 , a i−1 )+ w 3 p * (a i |a i−1 ) + w 4 p * (a i ) where i w i = 1, and p * indicates our m-gram estimate for p(a i |a i−m+1 , ..., a i−1 ).

C Walkthrough Evaluation
In Figure 10, we provide a piece of walkthrough trajectory of Zork1, with GPT-2 and n-gram generated actions at each state. Note that n-gram actions are mostly limited to be no more than two tokens, while GPT-2 can generate more complex actions like "put sword in case".
In Figure 9, we provide game-specific metric curves for Zork1 and Detective. On harder games like Zork1, there is significant gap between GPT-2 and n-gram, while easy games like Detective the gap is very small.
We provide per-game results for model variants in Table 5. It is interesting that CALM (w/ Jericho) is significantly better than CALM (GPT-2) on the games of Temple and Deephome (non-trivial scores achieved), which are not the games with ClubFloyd scripts added. On the other hand, games like 905 and moonlit have scripts added, but do not get improved.
In the end, we append one example trajectory piece of DRRN + CALM (GPT-2) on Zork1 (    o v e t h e t r o p h y c a s e h a n g s an e l v i s h sword o f g r e a t a n t i q u i t y .
[ SEP ] g e t sword [ SEP ] t a k e n . you a r e c a r r y i n g : a sword a n a s t y k n i f e a r o p e a b r a s s l a n t e r n a c l o v e o f g a r l i c a j e w e l − e n c r u s t e d egg l i v i n g room you a r e i n t h e l i v i n g room . t h e r e i s a doorway t o t h e e a s t , a wooden d o o r w i t h s t r a n g e g o t h i c l e t t e r i n g t o t h e west , which a p p e a r s t o be n a i l e d s h u t , a t r o p h y c a s e , and a l a r g e o r i e n t a l r u g i n t h e c e n t e r o f t h e room .
[ SEP ] g p t 2 a c t s : t h r o w l a n t e r n a t egg ' , ' t h r o w k n i f e a t egg ' , ' t h r o w sword a t egg ' , ' t h r o w sword a t l a n t e r n ' , ' p u t down a l l ' , ' p u t down r o p e ' , ' p u t down egg ' , ' p u t down g a r l i c ' , ' p u t down l a n t e r n ' , ' p u t down k n i f e ' , ' p u t down sword ' , ' t a k e on egg ' , ' t u r n on l a n t e r n ' , ' e a s t ' ] g o l d a c t : [ ' t u r n on l a n t e r n ' ] s c o r e : 40

d up . To t h e n o r t h a n a r r o w p a t h w i n d s t h r o u g h t h e t r e e s . [ SEP ] n o r t h [ SEP ] F o r e s t P a t h T h i s i s a p a t h w i n d i n g t h r o u g h a d i m l y l i t f o r e s t . The p a t h h e a d s n o r t h s o u t h h e r e .
One p a r t i c u l a r l y l a r g e t r e e w i t h some low b r a n c h e s s t a n d s a t t h e e d g e o f t h e p a t h . You a r e empty h a n d e d . F o r e s t P a t h T h i s i s a p a t h w i n d i n g t h r o u g h a d i m l y l i t f o r e s t . The p a t h h e a d s n o r t h s o u t h h e r e . One p a r t i c u l a r l y l a r g e t r e e w i t h some low b r a n c h e s s t a n d s a t t h e e d g e o f t h e p a t h . [ SEP ] A c t i o n s 6 2 2 3 5 : [ ' c l i m b t r e e ' , ' up ' , ' s ' , ' n ' , ' n o r t h ' , ' s o u t h ' , ' e a s t ' , ' west ' ] Q v a l u e s 6 2 2 3 5 : [ 1 5 . 3 8 , 1 5 . 2 9 , 1 2 . 4 , 1 2 . 3 4 , 1 1 . 9 9 , 1 1 . 7 3 , 1 1 . 1 3 , 1 0 . 5 7 ] >> A c t i o n 6 2 2 3 5 : up Reward62235 : 0 , S c o r e 0 , Done F a l s e S t a t e 6 2 2 3 6 : [ CLS ] F o r e s t P a t h T h i s i s a p a t h w i n d i n g t h r o u g h a d i m l y l i t f o r e s t . The p a t h h e a d s n o r t h s o u t h h e r e . One p a r t i c u l a r l y l a r g e t r e e w i t h some low b r a n c h e s s t a n d Q v a l u e s 6 2 2 3 7 : [ 1 2 . 9 3 , 1 2 . 9 3 , 1 1 . 4 9 , 1 1 . 2 2 , 1 1 . 1 , 9 . 4 9 , 9 . 4 1 , 9 .  Q v a l u e s 6 2 2 4 0 : [ 1 7 . 9 , 1 7 . 8 9 , 1 4 . 7 1 , 1 4 . 5 , 1 3 . 5 9 , 1 3 . 5 1 , 1 2 . 9 7 , 1 2 [ ' open window ' , ' t a k e i t ' , ' g e t egg ' , ' g e t e n c r u s t e d egg ' , ' e a t egg ' , ' t a k e egg ' , ' g e t i t ' , ' t a k e a l l ' , ' g e t a l l ' , ' n o r t h ' , ' n o r t h w e s t ' , ' s o u t h ' , ' e a s t ' , ' s o u t h w e s t ' ] Q v a l u e s 6 2 2 4 1 : [ 1 9 . 9 1 , 1 6 . 5 2 , 1 6 . 4 4 , 1 6 . 4 , 1 6 . 2 6 , 1 6 . 2 5 , 1 4 . 9 , 1 4 . 2 5 , 1 4 . 0 3 , 1 3 . 8 6 , 1 3 . 1 7 , 1 2 . 4 9 , 1 2 . 4 5 , 1 2  . 7 4 , 1 3 . 6 8 , 1 2 . 3 8 , 1 1 . 5 3 , 1 1 . 4 , 1 1 . 2 5 , 1 1 . 1 3 , 1 1 . 0 6 , 1 0 . 3 6 , 1 0 . 2 3 , 1 0 . 1 5 , 9 . 6 3 , 9 . 6 1 , 6 . 5 4 ] >> A c t i o n 6 2 2 4 3 : open s a c k Reward62243 : 0 , S c o r e 1 5 , Done F a l s e