Learning to Generalize for Sequential Decision Making

We consider problems of making sequences of decisions to accomplish tasks, interacting via the medium of language. These problems are often tackled with reinforcement learning approaches. We find that these models do not generalize well when applied to novel task domains. However, the large amount of computation necessary to adequately train and explore the search space of sequential decision making, under a reinforcement learning paradigm, precludes the inclusion of large contextualized language models, which might otherwise enable the desired generalization ability. We introduce a teacher-student imitation learning methodology and a means of converting a reinforcement learning model into a natural language understanding model. Together, these methodologies enable the introduction of contextualized language models into the sequential decision making problem space. We show that models can learn faster and generalize more, leveraging both the imitation learning and the reformulation. Our models exceed teacher performance on various held-out decision problems, by up to 7% on in-domain problems and 24% on out-of-domain problems.


Introduction
We make many decisions as we interact with the world. When we are rewarded (respectively, punished), we learn to modify not only the proximal cause of the stimulus but the chain of decisions leading up to it, to encourage (respectively, discourage) future similar results. This process naturally is the paradigm of Reinforcement Learning (RL). Policy-based learning seeks to find good estimates for Q(s, a), a function that returns the expected cumulative reward (known as a Q-value) if action a is chosen at state s. A desirable property of methodologies to learn Q is their ability to generalize such that an appropriate action can be taken when encountering a previously unseen state.
Recent advances have shown strong evidence of generalization in spatiotemporal modalities such as robotic manipulation (Xu et al., 2018), video games (Tessler et al., 2017), and autonomous navigation (Zhu et al., 2017). However, in the modality of language, there is less work applying generalization approaches to decision making. Useful applications of sequential decision making language models are personal assistants that proactively anticipate client needs; anti-phishing mediation agents that waste a would-be thief's time with relevant but non-helpful responses; and investigative journalist assistants that determine what to read, whom to contact, and what questions to ask to create a revelatory news report.
Neural reinforcement learning (RL) training approaches, such as those used to play action video games (Mnih et al., 2013), have potential applicability in language-based decision making due to their ability to learn to navigate adversarial or exploratory scenarios. Naturally, the generalization and background knowledge capability afforded by large contextualized language models such as BERT (Devlin et al., 2019) may be applicable as well. A useful virtual world proxy in which to explore these approaches' applicability is that of text adventure game playing. In a text adventure game, a player is immersed in an environment by reading textual descriptions of a scene and issuing natural language commands to navigate inside the scene. The player discovers and interacts with entities and accomplishes goals, while receiving explicit rewards for doing so.
Learning to play text games is a useful pursuit because it is a convenient proxy for the real world cases cited above. Unlike these, plentiful data for numerous games exist, an endless supply of games can be constructed, and text games have built-in re-ward functions, making them suitable for RL. This class of problems is also useful because it is challenging: after exposure to a family of games that explore the same topic and have similar gameplay (e.g., games involving cooking a specified recipe), human players perform nearly perfectly on additional games, but computer models struggle.
Why is this? Humans quickly understand the situation they are placed in and can make rational decisions based on trial-and-error and life experience, which we can call commonsense knowledge. Knowing a priori that, e.g., a closed door should be open or that it is helpful to light a lamp in a dark dungeon allows (human) players to learn faster. Even though these games have the complexity of finite-state machines, computer models cannot learn to play them well. The problem appears to be due to a lack of generalization caused by a lack of commonsense. To a computer model, considering whether to fry using a fridge is no more ludicrous than considering whether to fry using a plate (which, to an untrained human cook, may be plausible, though is certainly not a good idea). Both actions can be discouraged by negative reinforcement, but a human only needs to learn not to do the latter. Furthermore, a computer player learning that one can chop carrots with a knife may not generalize that one can chop celery the same way, but a human surely will.
There is existing work in learning to play text games with RL (Narasimhan et al., 2015;Yuan et al., 2018;Kim, 2014;Zahavy et al., 2018;Yin and May, 2019a;Tessler et al., 2019) but the standard pattern of incorporating large language models such as BERT (Devlin et al., 2019) has not yet been seen in current literature. It turns out that this integration is not trivial. Most models that use BERT and its ilk predominantly apply their results to supervised learning tasks that have training data with ground truth (Zellers et al., 2018;Wang et al., 2018) or at least, in the case of generation-based tasks like dialogue and translation, a corpus of desirable output to mimic (Wolf et al., 2019;Imamura and Sumita, 2019). For tasks suited to RL such as the exploration of and interaction with a world, there is no true target or even, initially, a corpus, and thus learning can only proceed iteratively via, e.g., exploration-exploitation (Mnih et al., 2013), which requires millions of training iterations to converge (Yin and May, 2019a;Narasimhan et al., 2017;Mnih et al., 2013). Integrating this process  Figure 1: Comparison of the training process between DQN (left) and teacher-student DQN (right). In regular DQN training, the agent plays games to collect partial game-playing into memory, for later sampling and training. In the teacher-student training method, the teacher-a well-trained agent-generates a partial play curriculum. Student agents use that curriculum only to train their models.
with the additional overhead of fine-tuning a large model like BERT leads to an impractical slowdown: for the experiments considered in this work, the baseline models that use CNN require a little more than three weeks to train on an Nvidia P100 GPUequipped machine. Using the same models on the same tasks run for the same number of iterations on the same hardware while fine-tuning a 12-layer BERT model would take more than two years.
In this work, we compare different previously used representation models for deep RL through an imitation learning method that first trains a lightweight teacher using exploration-exploitation, and then uses that trained model to train a more heavyweight student model. This dramatically decreases the amount of training time needed to learn. Moreover, we devise a means of casting an RL problem into a supervised learning paradigm, allowing better exploitation of large contextualized language models. In so doing, we show that agents can benefit from both the imitation learning and the reformulation, converging faster than other models, and exceeding teacher performance by 7% and 24% on both in-and out-of-domain problems, despite the limited search space.
The novel contributions of this work are: • We develop a teacher-student model training method for sequential text-based decision making problems, enabling the efficient incorporation of heavy-weight external information models. • We develop a method for casting student RL model training in the same form as a supervised Natural Language Understanding task, enabling solutions to those tasks to be applied to sequential decision making.
• We evaluate our methods on in-domain and out-of-domain text game data sets, extrinsically and intrinsically demonstrating the effect of external commonsense knowledge and generalization at improving model abilities. • We release our data, models, and code for documentation, replicability, and to enable subsequent improvements. 1 2 Background: Reinforcement Learning for Text Games The core approach of Deep Q-Networks (DQN) as described by Mnih et al. (2015) is to build a replay memory of partial games with associated scores, and use this to learn a function f DQN : (S, A) → R, where f DQN (s, a) is the Q-value obtained by choosing action a ∈ A when in state s ∈ S; from s, choosing arg maxã ∈A f DQN (s,ã) affords the optimal action policy and this is generally used at inference time. 2 The DQN, which predicts Q given a (state, action) pair, is trained in an exploration-exploitation method known as -greedy (Mnih et al., 2015): first, the agent plays the game stochastically, generally guided by f DQN , but with probability choosing a random action instead. The hyperparameter usually decays from 1 to 0 during the training process. As the agent plays, it collects partial play samples (s, a, r, s ), denoting taking action a at the game state s, and the immediate reward r plus the next state s reached for doing so, into a replay memory. The DQN is then improved by sampling from the replay memory, and reducing loss between f DQN and the true Q, which is estimated as Square error loss is minimized to improve f DQN along the gradient: 1 https://github.com/yinxusen/learning_ to_generalize 2 There are various methods to choose actions according to policies, but for exposition, the greedy method is the most representative and straightforward one. See Section 4.2 for details.
The improved DQN is used to collect more replay data as the process iterates, as depicted in Figure 1 (left).
Equation 1 shows that at every step, for every sampled state, we can only estimate loss for a single state-action pair; we do not have the r or s for actions other than a. The models eventually converge, but only after millions of training steps (Mnih et al., 2013;Yin and May, 2019a).

Teacher-Student DQN Training
After running DQN training as described in Section 2 for some time, our well-trained agent, which we call the teacher, can provide Q-values for a set of actions A at every step. We can then collect lots of (state s, action set A, Q-table Q) game-play tuple data into a curriculum pool by repeatedly playing the game and obtaining f DQN −teacher (s, a) for all a ∈ A. We now use that data to train a new agent f DQN −student (the student), using the same DQN approach described in Section 2. However, unlike in the previous DQN scenario, the curriculum pool now contains Q-values for all of A at each state. 3 We can train all actions at one step for each trajectory since we have Q-values for all actions. Thus the loss is and the learning procedure is as depicted on the right side of Figure 1.
The main disadvantage of teacher-student learning is that in the student phase, the search space is bounded by that of the curriculum pool generated by the teacher agent. While a student model can generalize based on the curriculum pool's data, it cannot explore any more of the search space. On the other hand, student learning is much faster than teacher learning. The experience replay pool does not need to be repeatedly generated, and many more loss gradients can be calculated all at once. We will explore several architectures and configurations that take advantage of this speed.

State Representations
A fundamental parameter that must be specified is the input signal used to form the game state s and how it is encoded. For action video games, this generally consists of a sequence of images from the game display. We use a history of system description-player action sequences for text games, which we call a trajectory. We consider the following representation architectures for the trajectory, some of which are only possible to use in the significantly faster student learning scenario: CNN. While much work applied to text games uses LSTMs (Hochreiter and Schmidhuber, 1997) to represent the trajectory (Narasimhan et al., 2015;Ammanabrolu and Riedl, 2019;Yuan et al., 2018;Kostka et al., 2017;Ansari et al., 2018), we favor CNN encoders with position embeddings, which are faster to train than LSTMs (Zahavy et al., 2018;Yin and May, 2019a;Kim, 2014). This encoder is the only representation that is fast enough for training the teacher model, given the fact that the trajectory length is usually much longer than a single sentence or paragraph. We also experiment with it as a student model trajectory representation. This baseline CNN encoder uses randomly initialized word embeddings that are fine-tuned during training. This encoder has one layer, with 32 convolutional filters for each of size 3-5 (Kim, 2014).
CNN-GloVe. The CNN-GloVe encoder is identical to the CNN encoder except for the use of GloVe (Pennington et al., 2014) for word embeddings; these are not fine-tuned.
Transformer. We use the Transformer (Vaswani et al., 2017) architecture configured in the same way as the BERT-base uncased model with 12 layers (Devlin et al., 2019), but with all weights randomly initialized. This model will serve as a comparison with the following model.
BERT. We use the BERT-base uncased model with 12 layers. This model has the same architecture as Transformer but is initialized with BERT weights (Devlin et al., 2019).
We use a max-pooling layer over the output of CNN as the encoded state in the same way that we do with CNN-GloVe, while for Transformer and BERT, we use the pooling output from the CLS token as the encoded state. All encoded states from different encoders are passed through a dense linear layer of 32 dimensions to ensure the encoded state is of equal size across models.
We use BERT's provided Byte-Pair Encoding (Sennrich et al., 2016) sub-word tokenizer and vocabulary with 30,522 tokens for CNN, Transformer, and BERT. For CNN-GloVe, we use the GloVe 6B model with 400,000 tokens and the TreeBank and an LSTM into state and action representations, respectively, followed by a dense layer to compute the Q-values. On the bottom, we show a truncated example of dialogue from a text game in the cooking genre, with S1 and S2 representing the system's descriptions, and P1 showing the player's first actions in response to S1. S1 + P1 + S2 is an example of a trajectory. P2 shows a set of admissible actions.
word tokenizer from NLTK (Loper and Bird, 2002) since GloVe embeddings are pre-determined and not compatible with BPE. We use a zero vector as the padding token and average of all word embeddings as the unknown token for CNN-GloVe. CNN uses a word embedding size of 64, while for CNN-GloVe and BERT, we use the pre-trained word embedding size, i.e., 50 dimensions for CNN-GloVe (we choose this dimension because it is close to our CNN) and 768 for BERT (so does Transformer).

Action Representations
A consequence of learning to play different games is that actions differ from one game to another. Vanilla DQNs, introduced by (Mnih et al., 2015), are incompatible with this modification since they presume a predefined finite and consistent action space, such as the directions and push buttons of a joystick. Additionally, vanilla DQNs presume no semantic relatedness among action spaces. In text games, by contrast, it would make sense for, e.g., open the door to be semantically closer to shut the door than to dice the carrot.
In our experiments, we assume the action set for a test game may be unknown at training time, and that actions may have some interrelatedness. 4 We  iter and Schmidhuber, 1997) and scored against state representations by inner products and an extra dense layer. In preliminary experiments, we found that LSTMs worked better than CNN on the small and similar actions in our space, such as take yellow potato from fridge and dice purple potato. We limit these actions to a maximum length of 10 tokens. We use DRRNs in both teacher and student scenarios.

Game-Playing as Sequential Natural Language Understanding Tasks
Large contextualized language models built on the Transformer architecture such as BERT and GPT (Radford et al., 2018) have been used in supervised multiple-choice Natural Language Understanding (NLU) tasks. While we have so far encoded trajectories and actions separately in the DRRN formulation of DQN (Section 3.2), NLU task architectures commonly encode context and hypothesis together, using a dense final layer to obtain scalar confidence in the hypothesis being the correct result of the context. This is then trained (with a cross-entropy loss) across all hypotheses for that context. By considering trajectories as context, actions as hypotheses, and arg maxã ∈A f DQN −teacher (t,ã) 5 as a label for trajectory t from the curriculum pool, we may easily switch to this framework, now minimizing a standard cross-entropy loss in place of DQN student learning. We call this model BERT-NLU-CE.
At evaluation time, the model chooses an action to take given a trajectory, but we are no longer explicitly learning a new Q-function other than simply learning a preferred choice. Of course, having an existing Q-table from the teacher model, we may instead replace the crossentropy loss with the familiar mean squared error loss (Section 3). This model, which we call BERT-NLU-SE, operates the same way as BERT-NLU-CE, but the values associated with each (trajectory, action) pair are once again regarded as Q-values. Figure 3 depicts the architecture of BERT-NLU-SE; BERT-NLU-CE is identical except the output is not explicitly intended to be a Q-value.
While most NLU tasks like SWAG (Zellers et al., 2018) or ARC (Clark et al., 2018) have no more than five hypotheses to choose from, even artificially constrained text-based games may have hundreds of potential choices. To make training feasible for text games, given each trajectory, we randomly sample three possible actions, along with the teacher model's most favored one. At evaluation time, the model can choose from all admissible actions.

Games and Evaluation Methodology
Unlike most video-or text-game-playing work (Mnih et al., 2013;Zahavy et al., 2018;Yin and May, 2019a;Narasimhan et al., 2015) which incrementally learns to play games through RL approaches and reports results on those same games, we evaluate on games that are not seen during learning. Our games are generated from the TextWorld platform (Côté et al., 2019), which procedurally generates a wide variety of game variants with different maps, objectives, actions, threats, and background text, given user-supplied inputs. The platform provides the set of admissible actions, i.e., legal actions available at each state of each game. There are between 10-100 of these actions depending on the context.

Training and Evaluation Data
We use the games released by Microsoft for the First TextWorld Problems 6 competition for our training set and an evaluation set of unseen but in-domain games. The competition provides 4,440 cooking games generated by the TextWorld framework. The goal of each game is to prepare a recipe. The action space is simple, yet expressive, and has a moderately large, though domain-limited, vocabulary. One example presents a recipe with directions such as fry the pork chop, chop the yellow bell pepper, fry the yellow bell pepper, prepare meal. To succeed, the player must find and take the items in various rooms and containers, use the correct cooking implements, and not spoil the ingredients, e.g., by frying an apple that has already been fried. We hold out 444 of these games as an in-domain test set.
To evaluate our models' ability to generalize beyond their training domain, we also evaluate on an out-of-dojain set, comprising 208 newly generated games 7 in a treasure-hunting genre. These have quite different actions, objects, and goals from cooking games. They generally require the player to navigate around rooms, find a specific object, and take a specified action with the entity, e.g., picking up a key and inserting into a gate's lock in a different room to unlock it. These games have little vocabulary overlap with any cooking games apart from basic commands like take and drop.

Evaluation
We report scores on each test set as a percentage of the possible total score. Each game has 1-6 points available. At evaluation time, we play each game twice, stopping after the sooner of 100 steps, game completion, or game failure, and consider each play independently. Scores can vary because each gameplay uses an initial knowledge graph map construction built via random walks (Ammanabrolu and Riedl, 2019) and because confidence bound is learned per action (Yin and May, 2019b), such that at evaluation time, lower-confidence actions are chosen with more stochasticity. An agent taking purely random walks (a low-bar baseline) scores 14% on the in-domain test and 16% on out-ofdomain.
We train the teacher agent for 10 million steps on the 3,960 training games in the cooking domain, using deep Q-learning described in Section 3.2. We use a curriculum training schema (Yin and May, 2019b) to train our teacher model. During training, each 5,000-step checkpoint takes 25 minutes on a single P100 GPU. We decay from 1 to 0 7 https://github.com/microsoft/ TextWorld#usage during training. The teacher agent scores 70% on in-domain test and 33% on out-of-domain test.

Student Experiment Results
Having trained a teacher model using DQN and allowing unlimited exploration of the game space, we now experiment with several student learning approaches. Relative to the teacher model, the students are constrained to explore using data generated from the trained teacher model. This restriction limits their ability to search but enables much faster training and, consequently, richer models. All student models are trained for 500,000 steps of 32 (trajectory, action) pairs per batch, saving checkpoints every 5,000 steps and generating results for in-and out-of-domain test sets. Running on a single P100 GPU, all Transformer-based models take 75-80 minutes per 5000 steps, while CNN-based models take 13 minutes.

Data Generation from Teacher Models
We generate student curriculum pools from the trained teacher model by playing all Cooking-Train games in random order. Specifically, we play games with the teacher agent using -greedy search (Section 2). We uniformly sample ∈ [0, 1] among different game playing episodes to increase the variety of trajectories exhibited to student learners. We collect the trajectory, all actions, and Q-values assigned to each action by the teacher model for each game playing step. In total, we collect 10 million instances of such tuples from the 3,960 Cooking-Train games.

In-Domain DRRN Results
In Figure 4, we compare student model learning to the teacher model's final position (horizontal line). We see that for many of the models, the trade-off to a potentially more sophisticated architecture is not worth the damage caused by limited exploration. As expected, our baseline model, CNN, which is the same model used for teacher training, converges to 67% of the total possible score at around 300,000 steps of training; the teacher model is at 70%. CNN-GloVe, compared to CNN, is even worse and converges more slowly. Even though CNN-GloVe is equipped with pre-trained word embeddings, the student agent cannot benefit from it.
Transformer performs comparably to CNN, but BERT learns much more quickly than all other models, reaching 72% of the total score on test games; 5% higher than any other student models and somewhat better than the teacher model, which is an encouraging preliminary result.

In-Domain NLU Results
In Figure 5 we explore the performance of the NLU-inspired architecture (Section 3.3). The crossentropy-based approach, BERT-NLU-CE, is the most similar to standard supervised NLU tasks and performs comparably to the DRRN teacher-student framework. However, BERT-NLU-SE, which directly regresses to the Q-function's value, quickly converges to around 77% of optimal scores, 7 points higher than the teacher model. Independent of the method for learning Q-values, we can choose between multiple methods to apply GRD UCB Sampling t=0.01 t=0.1 t=1. 0  SE  72  77  72  66  27  CE  67  71  65  65  66   Table 1: Comparing different inference methods for BERT-NLU agents over cooking-Test (percentage of total possible score). We compare two BERT-NLU agents, one trained with square error loss (SE) and one with cross entropy loss (CE). We use three different methods of choosing actions from policies: greedy (GRD), LinUCB (UCB) and sampling at three different temperatures.

SE CE BERT
77 71 no-init 69 60 freeze-all-but-pooler 26 25 freeze-to-penultimate 61 54 Table 2: Ablation study of the NLU method. We compare the best test-set evaluation scores (percentage of total) over the 100 epochs of training for each model trained with either square error loss (SE) or crossentropy loss (CE). the policy at inference time. We compare three frequently used methods--greedy, sampling, and LinUCB, a bandit feedback method (Auer, 2003;Abe et al., 2003;Abbasi-yadkori et al., 2011)-in Table 1. Following Yin and May (2019b), we use = 0 for the -greedy method. For the sampling method, we choose different temperatures over the Q-values. We follow Yin and May (2019b) for the LinUCB method. In Table 1, we ablate BERT-NLU-CE and BERT-NLU-SE training and five different inference approaches. The same Q-values are used for each setting using BERT-NLU-CE and for each setting using BERT-NLU-SE. We find that models trained with square error loss and evaluated using sampling are highly sensistive to temperature; cross-entropy-trained models are fairly insensitive. However, both the -greedy and the sampling methods perform worse than the LinUCB method.
We ablate the impact of fine-tuning BERT in Table 2, showing what happens if we do not finetune except the pooler (freeze-all-but-pooler), only fine-tune the last layer and the pooler (freeze-topenultimate), or fine-tune all layers (BERT). We also show the fine-tuned equivalent Transformer that is not pre-trained (no-init) for comparison. All settings fine-tune the 768-parameter last dense layer on top of Transformer to compute Q-values.  Figure 6: Comparing performance on out-of-domain test games. Only two agents exceed teacher results (by 24% and 13%, respectively). Analysis in Figure 7 shows that the superior performing BERT-NLU-SE benefits from external knowledge, while CNN-DRRN mostly benefits from stochasticity.
The freeze-to-penultimate allows the final Q-value layer, the pooler, and the last layer of BERT to train. In total, more than seven million parameters are trainable in freeze-to-penultimate. However, the performance still has a 16% gap compared to the fully fine-tuned 110-million-parameter BERT models. This ablation study shows that the benefits coming from BERT can not be reproduced by simply using out-of-the-box BERT weights, which would speed up the training process, and underscores the importance of imitation learning. Figure 6 shows the result of evaluating with out-ofdomain games. These games have different goals and action sets from the training games, so it is possible during training to observe performance curve drops, an indication that the model is overfitting on the cooking game genre and not properly generalizing. Most of the DRRN student models exhibit some overfitting; only the CNN model can learn somewhat well and exceeds the performance (46%) of the teacher model (33%). BERT-NLU-SE, the NLU-style architecture that fine-tunes BERT and is trained to directly estimate Q-values,greatly exceeds the teacher model's performance (57%) on these games from an unseen genre.

Discussion
In this section we seek to understand the following: • What extra information BERT-NLU-SE leverages compared to other DRRN models (Figures 4 and 5); • What generalization and extra information BERT-NLU-SE leverages on out-of-domain games, and why the CNN student model performs better than expected on out-of-domain games ( Figure 6).
A qualitative investigation of model performance on in-domain test sets shows that game failure arises when a model decides to prepare an ingredient improperly, (e.g., to use the BBQ instead of the stove to fry). Models initialized with BERT have fewer such failures, indicating that BERT provides background cooking knowledge, beyond what can be learned from the curriculum pool. Example gameplays and complete statistics on test games are provided in the Appendix.
A similar pattern is observed for out-of-domain tests. One test requires the player to use four different kinds of keys with matched locked containers. As the curriculum pool does not have any information relevant for this task, models without general background knowledge suffer. In the key/lock test game (a readout is in the Appendix), the teacher model repeatedly unlocks and locks a single box, and puts and takes the same key without making progress. The BERT-NLU-SE model, however, can correctly open the sequence of containers. Figure 7 provides more insight into model performance, including an explanation for the surprising success of the CNN model. That figure shows the KL-divergence (Kullback and Leibler, 1951) between a uniform distribution and the distribution formed from the Q-values (the categorical choice distribution for BERT-NLU-CE) at every step during every out-of-domain test, computed from the final point of each model. The CNN model's distribution is closer to uniform than the others. As stochastic choices are made at test time when the action distribution is uncertain (see Section 4.2), the CNN model performs more exploration during the evaluation of Treasure hunting games. These games do not have failure cases like the in-domain test games, so there can be some benefit to stochasticity. The other models are more confident 8 and, except for BERT-NLU-SE, are generally wrong. This result indicates that equipped with the ability to generalize from BERT pre-training, BERT-NLU-SE has learned the skill of decision making, rather than the ability to memorize patterns.  (Zahavy et al., 2018), the introduction of a knowledge graph (Ammanabrolu and Riedl, 2019), text understanding with dependency parsing (Yin and May, 2019a), and the bandit feedback method for agent evaluation (Yin and May, 2019b). However, previous work uses different games to evaluate, making it difficult to compare results comprehensively. With the TextWorld framework's availability, there is more and more work concentrating on the generalization ability of agents, which seldom appears in the video game playing domain. Yuan et al. (2018) work on generalization of agents on variants of a very simple coincollecting game. The simplicity of their games enables them to use an LSTM-DQN method with a counting-based reward. Ammanabrolu and Riedl (2019) use a knowledge graph as a persistent memory to encode states, while we use a knowledge graph later on to make actions more informative.
The TextWorld competition has yielded a variety of works that use different approaches and methods: Yuan et al. (2019) co-train a DQN with a question answering system for building new interactive machine reading comprehension tasks while creating agents to solve games. Madotto et al. (2020) describe a non-RL method to learn agents, by first randomly playing on training games, then collecting all winning trajectories. By using these trajectories as training data, they manage to transform an RL problem into supervised learning. Adolphs and Hofmann (2020) use an actor-critic framework and prune the action-space by using hierarchical RL and a specialized module trained on a recipe database to build better agents. Jain et al. (2020) apply the action elimination method proposed by Zahavy et al. (2018) on Zork to the cooking games.
For teacher-student training, Rusu et al. (2015) design a policy distillation method that trains different agents as teacher agents. Each of these teacher agents learns to play a single and separate game. Then they build one student learner that can be trained with a supervised learning method to distill the policy knowledge for multi-game playing. Ansari et al. (2018) also use teacher-student training for text-based games. However, our teacherstudent training method is different: we use one teacher that can play multiple games to guide multiple student agents' learning processes.

Conclusion
We provide a recipe for integrating large contextualized language models and deep reinforcement learning, applying to sequential decision making and a demonstration on the proxy task of text games, showing dramatic improvements over the standard practice, particularly in out-of-domain held-out tests. We expect to apply this approach to various challenging real-world sequential decision scenarios, such as goal-directed dialogue and active information-gathering.

Acknowledgments
This research is based upon work supported in part by the DARPA Machine Common Sense program, Contract N660011924032 through NIWC. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

A Appendix
This appendix contains comprehensive results of the models explored in this work on the two test sets. Table 3 shows the number of in-domain games won, lost due to incorrect handling of materials, and lost due to taking too many steps; the latter category is helpful, along with Q-table KL-divergence from a uniform distribution, in explaining the poor performance of overly deterministic BERT-NLU-CE, which fails very infrequently, but often gets stuck in a step loop instead. insert key into cabinet You put the key into the cabinet.
take key from cabinet You take the key from the cabinet.
close cabinet You close the cabinet.
open cabinet You open the cabinet.
drop key You drop the key on the ground.
take key You pick up the key from the ground. examine chest The chest looks strong, and impossible to open. You can't see inside it because the lid's in your way.
drop keycard You drop the keycard on the ground. take keycard You pick up the keycard from the ground.  Table 3: Tally of in-domain test game outcomes for the models explored in this work, indicating which games are won, which games lead to a failure (by spoiling ingredients), and which end when the step limit (100) is reached.
BERT-NLU-CE and BERT-NLU-SE fail on in-domain tests the least, indicating a better intuitive understanding of how to prepare recipes. Because BERT-NLU-CE is so confident (Figure 7) it can get stuck in a deterministic loop; this is reflected in a larger number of 'too many steps' outcomes relative to BERT-NLU-SE.  open chest You open the chest, revealing a passkey.
take passkey from chest You take the passkey from the chest. Your score has just gone up by one point. *** The End *** You scored 1 out of a possible 1, in 26 turn(s).

C Teacher on the Treasure Hunting Game
Who's got a virtual machine and is about to play through an life changing round of TextWorld? You do! Here is how to play! Your first objective is to pick up the keycard in the cubicle. take worm You pick up the worm from the ground.
drop worm You drop the worm on the ground. take keycard You pick up the keycard from the ground.
drop keycard You drop the keycard on the ground.
insert latchkey into cabinet You put the latchkey into the cabinet.
take latchkey from cabinet You take the latchkey from the cabinet.
examine box The box looks strong, and impossible to crack open. It is closed.
look -= Cubicle =-I never took you for the sort of person who would show up in a cubicle, but I guess I was wrong. You make out a locked chest. You see a locked type A locker here. You can see a box. You make out an open cabinet. The cabinet is empty! What a waste of a day! There is a closed door leading south. There is a closed passageway leading west. There is a keycard and a worm on the floor. insert type A latchkey into cabinet You put the type A latchkey into the cabinet.
take type A latchkey from cabinet You take the type A latchkey from the cabinet.
take mouse from box You take the mouse from the box.
examine mouse The mouse would seem to be to fit in here insert latchkey into box (the latchkey into the box) You put the latchkey into the box.
take latchkey from box You take the latchkey from the box.
insert type A latchkey into box You put the type A latchkey into the box.
drop mouse You drop the mouse on the ground.
insert keycard into box You put the keycard into the box.
take keycard from box You take the keycard from the box.
close box You close the box.
take mouse You pick up the mouse from the ground.
insert mouse into cabinet You put the mouse into the cabinet.
take mouse from cabinet You take the mouse from the cabinet.
lock box with keycard You lock the box.
insert keycard into cabinet You put the keycard into the cabinet.
take keycard from cabinet You take the keycard from the cabinet. examine mouse The mouse would seem to be to fit in here drop mouse You drop the mouse on the ground. take mouse You pick up the mouse from the ground.
insert mouse into cabinet You put the mouse into the cabinet.
take mouse from cabinet You take the mouse from the cabinet.
drop keycard You drop the keycard on the ground.
insert latchkey into cabinet You put the latchkey into the cabinet. take latchkey from cabinet You take the latchkey from the cabinet. examine box The box looks strong, and impossible to crack open. It is closed.
look -= Cubicle =-I never took you for the sort of person who would show up in a cubicle, but I guess I was wrong. You make out a locked chest. You see a locked type A locker here. You can see a box. You make out an open cabinet. The cabinet is empty! What a waste of a day! There is a closed door leading south. There is a closed passageway leading west. There is a keycard and a worm on the floor. ===== won: False lost: False total steps: 100 D BERT-NLU-SE on one Cooking Game You are hungry! Let's cook a delicious meal. Check the cookbook in the kitchen for the recipe. Once done, enjoy your meal! -= Pantry =-You are in a pantry. An usual one. You can see a shelf. But the thing hasn't got anything on it. Oh! Why couldn't there just be stuff on it?
There is an open frosted-glass door leading north.
go north -= Kitchen =-You're now in the kitchen. Look out! It's a-oh, never mind, it's just a fridge. Suddenly, you bump your head on the ceiling, but it's not such a bad bump that it's going to prevent you from looking at objects and even things. The fridge contains some water, a diced block of cheese, a diced yellow bell pepper, a raw yellow potato, an orange bell pepper, a raw pork chop and a cilantro. You see a closed oven nearby. You can make out a take block of cheese from fridge You take the block of cheese from the fridge. Your score has just gone up by one point.
examine cookbook You open the copy of "Cooking: A Modern Approach (3rd Ed.)" and start reading: Recipe #1 ---Gather all following ingredients and follow the directions to prepare this tasty meal. Ingredients: block of cheese water yellow bell pepper Directions: dice the block of cheese grill the block of cheese dice the yellow bell pepper roast the yellow bell pepper prepare meal inventory You are carrying: a diced block of cheese take water from fridge You take the water from the fridge. Your score has just gone up by one point.
examine cookbook You open the copy of "Cooking: A Modern Approach (3rd Ed.)" and start reading: Recipe #1 ---Gather all following ingredients and follow the directions to prepare this tasty meal. Ingredients: block of cheese water yellow bell pepper Directions: dice the block of cheese grill the block of cheese dice the yellow bell pepper roast the yellow bell pepper prepare meal inventory You are carrying: some water a diced block of cheese take yellow bell pepper from fridge You take the yellow bell pepper from the fridge. Your score has just gone up by one point.
examine cookbook You open the copy of "Cooking: A Modern Approach (3rd Ed.)" and start reading: Recipe #1 ---Gather all following ingredients and follow the directions to prepare this tasty meal.
Ingredients: block of cheese water yellow bell pepper Directions: dice the block of cheese grill the block of cheese dice the yellow bell pepper roast the yellow bell pepper prepare meal inventory You are carrying: a diced yellow bell pepper some water a diced block of cheese cook yellow bell pepper with oven You roasted the yellow bell pepper. Your score has just gone up by one point.
examine cookbook You open the copy of "Cooking: A Modern Approach (3rd Ed.)" and start reading: Recipe #1 ---Gather all following ingredients and follow the directions to prepare this tasty meal. Ingredients: block of cheese water yellow bell pepper Directions: dice the block of cheese grill the block of cheese dice the yellow bell pepper roast the yellow bell pepper prepare meal inventory You are carrying: a diced roasted yellow bell pepper some water a diced block of cheese drop block of cheese You drop the block of cheese on the ground. take block of cheese You pick up the block of cheese from the ground.
inventory You are carrying: a diced block of cheese a diced roasted yellow bell pepper some water go east -= Backyard =-You arrive in a backyard. A typical kind of place. You start to take note of what's in the room. You can make out a patio chair. But the thing is empty, unfortunately. Hopefully, this discovery doesn't ruin your TextWorld experience! You rest your hand against a wall, but you miss the wall and fall onto a patio cook block of cheese with BBQ You grilled the block of cheese. Your score has just gone up by one point.
go west -= Kitchen =-You're now in the kitchen.
Look out! It's a-oh, never mind, it's just a fridge. Suddenly, you bump your head on the ceiling, but it's not such a bad bump that it's going to prevent you from looking at objects and even things. The fridge contains a raw yellow potato, an orange bell pepper, a raw pork chop and a cilantro. You see a closed oven nearby. You can make out a table. The table is massive. But the thing hasn't got anything on it. You can make out a counter. The counter is vast. On the counter you see a cookbook. Look over there! a stove. Now why would someone leave that there? The stove is conventional. But the thing is empty.
There is an open patio door leading east. There is an open frosted-glass door leading south. You need an exit without a door? You should try going west.
examine cookbook You open the copy of "Cooking: A Modern Approach (3rd Ed.)" and start reading: Recipe #1 ---Gather all following ingredients and follow the directions to prepare this tasty meal. Ingredients: block of cheese water yellow bell pepper Directions: dice the block of cheese grill the block of cheese dice the yellow bell pepper roast the yellow bell pepper prepare meal inventory You are carrying: a diced grilled block of cheese a diced roasted yellow bell pepper some water prepare meal Adding the meal to your inventory. Your score has just gone up by one point.
eat meal You eat the meal. Not bad. Your score has just gone up by one point. *** The End *** You scored 7 out of a possible 7, in 27 turn(s).

E Teacher on the Cooking Game
You are hungry! Let's cook a delicious meal. Check the cookbook in the kitchen for the recipe. Once done, enjoy your meal! -= Pantry =-You are in a pantry. An usual one. You can see a shelf. But the thing hasn't got anything on it. Oh! Why couldn't there just be stuff on it?
There is an open frosted-glass door leading north.
go north -= Kitchen =-You're now in the kitchen. Look out! It's a-oh, never mind, it's just a fridge. Suddenly, you bump your head on the ceiling, but it's not such a bad bump that it's going to prevent you from looking at objects and even things. The fridge contains some water, a diced block of cheese, a diced yellow bell pepper, a raw yellow potato, an orange bell pepper, a raw pork chop and a cilantro. You see a closed oven nearby. You can make out a table. The table is massive. But the thing hasn't got anything on it. You can make out a counter. The counter is vast. On the counter you see a cookbook. Look over there! a stove. Now why would someone leave that there? The stove is conventional. But the thing is empty.
There is an open patio door leading east. There is an open frosted-glass door leading south. You need an exit without a door? You should try going west.
examine cookbook You open the copy of "Cooking: A Modern Approach (3rd Ed.)" and start reading: Recipe #1 ---Gather all following ingredients and follow the directions to prepare this tasty meal. Ingredients: block of cheese water yellow bell pepper Directions: dice the block of cheese grill the block of cheese dice the yellow bell pepper roast the yellow bell pepper prepare meal inventory You are carrying nothing.
take block of cheese from fridge You take the block of cheese from the fridge. Your score has just gone up by one point.
examine cookbook You open the copy of "Cooking: A Modern Approach (3rd Ed.)" and start reading: Recipe #1 ---Gather all following ingredients and follow the directions to prepare this tasty meal. Ingredients: block of cheese water yellow bell pepper Directions: dice the block of cheese grill the block of cheese dice the yellow bell pepper roast the yellow bell pepper prepare meal inventory You are carrying: a diced block of cheese take yellow bell pepper from fridge You take the yellow bell pepper from the fridge. Your score has just gone up by one point.
examine cookbook You open the copy of "Cooking: A Modern Approach (3rd Ed.)" and start reading: Recipe #1 ---Gather all following ingredients and follow the directions to prepare this tasty meal.
Ingredients: block of cheese water yellow bell pepper Directions: dice the block of cheese grill the block of cheese dice the yellow bell pepper roast the yellow bell pepper prepare meal inventory You are carrying: a diced yellow bell pepper a diced block of cheese take water from fridge You take the water from the fridge. Your score has just gone up by one point.
examine cookbook You open the copy of "Cooking: A Modern Approach (3rd Ed.)" and start reading: Recipe #1 ---Gather all following ingredients and follow the directions to prepare this tasty meal. Ingredients: block of cheese water yellow bell pepper Directions: dice the block of cheese grill the block of cheese dice the yellow bell pepper roast the yellow bell pepper prepare meal inventory You are carrying: some water a diced yellow bell pepper a diced block of cheese cook yellow bell pepper with oven You roasted the yellow bell pepper. Your score has just gone up by one point.
examine cookbook You open the copy of "Cooking: A Modern Approach (3rd Ed.)" and start reading: Recipe #1 ---Gather all following ingredients and follow the directions to prepare this tasty meal. Ingredients: block of cheese water yellow bell pepper Directions: dice the block of cheese grill the block of cheese dice the yellow bell pepper roast the yellow bell pepper prepare meal inventory You are carrying: some water a diced roasted yellow bell pepper a diced block of cheese drop block of cheese You drop the block of cheese on the ground. take block of cheese You pick up the block of cheese from the ground.
inventory You are carrying: a diced block of cheese some water a diced roasted yellow bell pepper cook block of cheese with stove You fried the block of cheese. *** You lost! *** You scored 4 out of a possible 7, in 20 turn(s).