How to Motivate Your Dragon: Teaching Goal-Driven Agents to Speak and Act in Fantasy Worlds

We seek to create agents that both act and communicate with other agents in pursuit of a goal. Towards this end, we extend LIGHT (Urbanek et al. 2019)—a large-scale crowd-sourced fantasy text-game—with a dataset of quests. These contain natural language motivations paired with in-game goals and human demonstrations; completing a quest might require dialogue or actions (or both). We introduce a reinforcement learning system that (1) incorporates large-scale language modeling-based and commonsense reasoning-based pre-training to imbue the agent with relevant priors; and (2) leverages a factorized action space of action commands and dialogue, balancing between the two. We conduct zero-shot evaluations using held-out human expert demonstrations, showing that our agents are able to act consistently and talk naturally with respect to their motivations.


Introduction
There has been a recent improvement in the quality of natural language processing (NLP) and generation (NLG) by machine learning (ML) (Vaswani et al., 2017;Devlin et al., 2018); and in parallel, improvement to goal-oriented ML driven agents in the context of games (Vinyals et al., 2019;Schrittwieser et al., 2019). However, agents that can communicate with humans (and other agents) through natural language in pursuit of their goals are still primitive. One possible reason for this is that many datasets and tasks used for NLP are static, not supporting interaction and language grounding (Brooks, 1991;Feldman and Narayanan, 2004;Barsalou, 2008;Mikolov et al., 2016;Gauthier and Mordatch, 2016;Lake et al., 2017). Text-based games-where players see, act upon, and communicate within a dynamic world using natural 1 Data can be found here https://parl.ai/ projects/light/ language-provide a platform on which to develop such goal-driven agents.
LIGHT (Urbanek et al., 2019), a large-scale crowdsourced fantasy text-adventure game, consisting of a set of locations, characters, and objectsa possesses rich textual worlds, but without any notion of goals to train goal-driven agents. We present a dataset of quests for LIGHT and demonstrations of humans playing these quests (as seen in Figures 2 and 3), providing natural language descriptions in varying levels of abstraction of motivations for a given character in a particular setting.
To complete these quests, an agent must reason about potential actions and utterances based on incomplete descriptions of the locations, objects, and other characters. When a human is placed in a fantasy setting such as LIGHT, they already know that kings are royalty and must be treated respectfully, swords are weapons, etc.-commonsense knowledge that a learning agent must acquire to ensure successful interactions. To equip agents with relevant priors in such worlds, we domain-adapt the large-scale commonsense knowledge graph ATOMIC (Sap et al., 2019) to the LIGHT fantasy world-to build ATOMIC-LIGHT.
We then introduce a reinforcement learning (RL) system that incorporates large-scale language modeling and the above commonsense-based pretraining. We show that RL is superior to behavior cloning or other supervised training on our data; and that carefully combining pre-training with RL is superior to either.
However, we find that although pre-training can be an effective tool in this setting, it requires more finesse than in the standard supervised setting. In particular, we find that simply pre-training a model on a large "generic" corpus (Sap et al., 2019;Baumgartner et al., 2020) of commonsense/language data or pre-training on the domain specific LIGHT corpus, and then fine-tuning via RL is less effective than training RL from scratch. Furthermore, by

Motivations:
Timeline: -4 hours go to dangerous precipice Short I need to recover the dragon egg that was stolen and punish the knight.
-15 min get knights armor from knight -10 min get golden dragon egg Now hit knight Mid I need to return the golden dragon egg to my treasure hoard. +5 min put dragon egg on back +15 min eat the knight Long I need to build the largest hoard ever attained by any one dragon. +2 hours go to the mountains  carefully combining general and domain-specific pre-training, we observe large improvements over RL from scratch. In short, the contributions of this paper are threefold: (1) A dataset of quests, LIGHT-Quests, and a companion fantasy themed commonsense knowledge graph ATOMIC-LIGHT; (2) a reinforcement learning architecture and training methodology that use these datasets to create goal-driven agents that act and speak in the LIGHT environment; and (3) Empirical zero-shot evaluations based on human quest demonstrations and an analysis of large-scale transformer-based pre-training trends in static vs. interactive settings, showing that we have trained agents that act consistently and speak naturally with respect to their motivations.

Related Work
We focus on four major areas of related work: text-based game-playing, goal-oriented dialogue, commonsense reasoning in language, and general language-informed RL.
Text-based game-playing. (Côté et al., 2018) introduce TextWorld, a framework for procedurally generating text-based games via grammars, and (Yuan et al., 2018;Yin and May, 2019;Adolphs and Hofmann, 2019;Adhikari et al., 2020) build agents that operate in this environment-focusing on aspects such as efficient exploration and zeroshot generalization to new, procedurally generated environments. Similarly, (Hausknecht et al., 2020) introduce Jericho, a framework and series of baseline agents for interacting with human-made textgames such as Zork (Anderson et al., 1979). This resulted in agents developed by works such as (Zahavy et al., 2018;Ammanabrolu and Hausknecht, 2020), aiming to learn to execute contextually relevant actions. Other works such as (Narasimhan et al., 2015;He et al., 2016) explore how to best factorize such text-game action spaces. None of these works consider agents with motivations and personas nor require any dialogue.
Goal-oriented dialogue. This form of dialogue has traditionally been closely related to specific tasks useful in the context of personal assistants with dialogue interfaces (Henderson et al., 2014;El Asri et al., 2017). RL has been studied for such tasks, usually to improve dialogue state management (Singh et al., 2000;Pietquin et al., 2011;Fatemi et al., 2016) and to improve response quality (Li et al., 2016). In particular, the negotiation tasks of (Yarats and Lewis, 2017;Lewis et al., 2017), where two agents are trying to convince each other to perform certain actions, are related to the tasks in LIGHT-Quests. These works all lack environment grounding and the notion of diverse agent motivations.
Commonsense reasoning in language. Works such as (Bosselut et al., 2019;Guan et al., 2020) focus on pre-training transformer-based language learning systems with large-scale commonsense knowledge graphs such as ATOMIC (Sap et al., 2019) and ConceptNet (Speer and Havasi, 2012) for use in knowledge graph completion and story ending generation respectively. (Fulda et al., 2017;Ammanabrolu and Riedl, 2019;Ammanabrolu et al., 2020;Murugesan et al., 2020) look at commonsense reasoning in interactive environments, with the former focusing on affordance extraction using word embeddings and the latter three on transferring text-game playing skills via pretraining using question-answering and large-scale knowledge graphs.
Language-informed reinforcement learning. (Luketina et al., 2019) provide an overview of RL informed by natural language. Of these works, the ones most related to ours are those falling into the category of instruction following-where an agent's tasks are defined by high level instructions describing desired policies and goals (MacMahon et al., 2006;Kollar et al., 2010). Visual and embodied agents using natural language instructions (Bisk et al., 2016;Kolve et al., 2017;Anderson et al., 2018) or in language-based action spaces (Das et al., 2017) utilize interactivity and environment grounding but have no notion of agent motivations, nor make any attempt to explicitly model commonsense reasoning. Perhaps closest in spirit to this work is (Prabhumoye et al., 2020), where they use artificially selected goals in LIGHT and train RL agents to achieve them. Similarly to the others, this work does not contain the motivations provided by LIGHT-Quests nor any modeling of commonsense reasoning. Further, they limit their RL problem to 1 and 3-step trajectories that only involve speech, and no actions-compared to the human demonstrations in LIGHT-Quests which contain both actions and speech sequences of average length 12.92.

LIGHT-Quests and ATOMIC-LIGHT
This section first provides a brief overview of the LIGHT game environment, followed by descriptions of the LIGHT-Quests and ATOMIC-LIGHT datasets used in this paper.
Background. The LIGHT game environment is a multi-user fantasy text-adventure game consisting of a rich, diverse set of characters, locations, and objects (1775 characters, 663 locations, and 3462 objects). Characters are able to perform templated actions to interact with both objects and characters, and can speak to other characters through free form text. Actions in text games generally consist of verb phrases (VP) followed optionally by prepositional phrases (VP PP). For example, get OBJ, put OBJ, give OBJ to CHAR, etc.. There are 13 types of allowed verbs in LIGHT. These actions change the state of the world which is expressed to the player in the form of text descriptions. Figures 1, 2, and 3 summarize the data that we collected for LIGHT-Quests. Data is collected via crowdsourcing in two phases, first the quests then demonstration of humans playing them. During the first phase, crowdworkers were given a setting, i.e. situated in a world, in addition to a character and its corresponding persona and asked to describe in free form text what potential motivations or goals could be for that character in the given world. The kind of information given to the crowdworkers is seen in Figure 1. Simultaneously, they were also asked to provide a sequence of seven timeline actions-one action that needs to be completed now and three before and after at various user-defined intervalsfor how the character might go about achieving these motivations.

LIGHT-Quests
Given the information in Figure 1, the crowdworkers completed the above outlined tasks and produce data as seen in Figure 2. Motivations come in three levels of abstraction-short, mid, and long-corresponding to differing amounts of the timeline. For example, the short motivation is always guaranteed to correspond most closely to the now position on the timeline. Action annotation is pre-constrained based on the classes of verbs available within LIGHT. The rest of the action is completed as free form text as it may contain novel entities introduced in the motivations. There are 5982 training, 756 validation, and 748 test quests. Further details regarding the exact data collection process and details of LIGHT-Quests are found in Appendix A.1.1.
After collecting motivation and timelines for the quests, we deployed a two-player version of the LIGHT game, letting players attempt the quests for themselves in order to collect human demonstrations. Figure 3 shows an example human expert demonstration of a quest. Players were given a character, setting, motivation, and a partner agent and left to freely act in the world and talk to the partner in pursuit of their motivations. The partner agent is a fixed poly-encoder transformer model (Humeau et al., 2020) trained on the original LIGHT data as well as other human interactions derived via the deployed game-using 111k utterances in total. Players first receive a role-playing score on a scale of 1-5 through a Dungeon Master (DM), a learned model that ranks how likely their utterances are given the current context. Once they have accumulated a score reaching a certain threshold, they are allowed to perform actions. We employ this gamification mechanism to encourage players to role-play their character persona and its motivations, leading to improved user experience and data quality (Horsfall and Oikonomou, 2011). They are then given further reward if the actions they perform sequentially match those on the timeline for the given quest. The game ends after a maximum of six turns of dialogue per agent, i.e. twelve in total. The average sequence of a human demonstration is 12.92, with an average action sequence length of 2.18 and dialogue of 10.74. There are 1800 training, 100 validation, and 211 test human expert demonstrations after the data was filtered. Additional details and examples are found in Appendix A.2.

ATOMIC-LIGHT
Commonsense reasoning is a critical cornerstone when building learning agents that navigate spaces such as LIGHT-Quests. To this end, we domainadapt the large-scale commonsense knowledge base ATOMIC (Sap et al., 2019) to LIGHT. ATOMIC contains information relevant for everyday commonsense reasoning in the form of typed if-then relations with variables. ATOMIC is organized into a set of events, e.g. "X puts X's trust in Y" and annotated relation types such as "needs", "wants", "attributes", and "effects" that label the effects. It is designed to be a general atlas of com-monsense data and so is neither dependent on a specific environment or a character's persona and motivations.
To construct ATOMIC-LIGHT, we specifically use the relations for "intents", "effects", "wants" and "needs" and expand the subject, relation, object triples found in the graph into templated natural language sentences. These sentences are then rewritten to better reflect the fantasy LIGHT domain. Named entities and other noun phrases in ATOMIC are masked out and filled in using BERT (Devlin et al., 2018) fine-tuned using a masked language model loss on the entire LIGHT and LIGHT-Quests data. We investigate the benefits of such domain adaptation on downstream tasks in Section 4.3. An example of a clause using the wants relation in ATOMIC is as follows, "PersonX puts PersonX trust in PersonY, wants, rely on PersonY." In ATOMIC-LIGHT, this is rewritten to: "The merchant puts the merchant's trust in the guard, as a result the merchant wants to rely on the guard." Similarly, an example of an effect using the needs relation is, "Before, the merchant puts the merchant's trust in the guard, the merchant needs to be friends with the guard." ATOMIC-LIGHT contains 216686 training, 35340 validation, and 38565 test samples. Further details of the construction of this dataset are found in Appendix A.4.

Agents that Act and Speak
This section describes the creation of the agents that learn to act and speak conditioned on their motivations in the LIGHT environment. The overall architecture and training are first outlined, followed by a detailed discussion on types of encoder pretraining.

LIGHT RL Environment
The environment as seen in Figure 4 consists of three components. The first is a partner agent, which is a model trained to play other agents in the game, as in (Prabhumoye et al., 2020). Next is the game engine, which determines the effects of actions on the underlying game graph (Urbanek et al., 2019). Finally, there is the Dungeon Master (DM), which is trained to score the naturalness of dialogue.
Partner Agent. The partner agent is a polyencoder transformer model (Humeau et al., 2020)  Action Rewards via the Game Engine. All actions, either those of the agent-in-training or the partner agent, are processed by the engine, checking for goal state completion-hence known as act goals. For example, if the LIGHT agent had the motivation to acquire a sword, the goal could be completed via a: 1. self act completion: where the agent acquires a sword itself by picking it up, stealing it, convincing the partner to drop theirs so you can pick it up, etc.
2. partner act completion: where the agent uses speech to convince their partner to achieve the goal for them (e.g., by persuading the partner to give them the sword).
Reaching an act goal provides reward r a of 1 and 0 otherwise. At each step, the engine also provides us with the set of valid actions. These are the subset of the action space A which are guaranteed to be a valid change to the world from the current state s t , i.e. an action to give your partner a sword cannot be valid unless you possess the sword. Speech Rewards via the Dungeon Master. Following prior works on using transformers for automatic evaluation of natural language generation (Sellam et al., 2020), we utilize a learned model-the Dungeon Master (DM)-to score the agent's ability to speak. The DM used here is a poly-encoder model trained on collected human quest demonstrations as well as the original conversations in LIGHT. It is conditioned on quests and motivations and thus able to provide a (noisy) indication of how natural the agent's dialogue utterances are given its immediate context, similarly to the function of the DM during the data collection process. Given the dialogue portion of a human quest demonstration of length n, the DM returns a reward r u of 1 2n if an utterance was in the demonstration (for a maximum of one time per episode for each utterance from the demonstration). A further 1 2n is given each time the utterance is scored as being within the top-k most likely utterances by the DM. This naturalness objective will be hence referred to as a speech goal. These rewards thus also denser than act goals, helping the agent learn overall. Further, similarly to the game engine, the DM also provides a set of M valid utterances which are the M most likely dialogue candidates from the candidate set for the current context.

Training a LIGHT agent with Switch Reinforcement Learning
The overall architecture of our agent is shown in Figure 4. It consists of an encoder, a switch, an action network, and a dialogue network. First, we construct the action spaces-factorized into actions and utterances. The possible actions are the set of all actions taken in the demonstrations (4710 total) and the possible utterances are all utterances from the demonstrations (22672 total). The encoder network processes the setting, persona, motivation, as well as the full history of actions and dialogues performed by the agent and the partner, input as a text sequence. The features from the encoder, which here are the hidden states at the final layer of a transformer, are used as input by all following components of the agent. In Section 5 we show how different encoder training data affects the model. Next, a switch module makes the decision regarding whether the agent should act or talk in the current context and activates the corresponding policy network. In this work, the switch is simple: it outputs an action every k dialogue utterances; where during training k is chosen to match the ratio of utterances to actions on that particular quest from the human demonstrations, and during testing, k is chosen to match the average action to utterance ratio. Both the action and dialogue policies consist of a a single GRU layer followed by an n-layer feed-forward network given input features from the encoder. Once the LIGHT agent has output an utterance or action, it is processed by the environment-the partner agent, the game engine and the DM.
We use A2C (Mnih et al., 2016) to train the LIGHT agent, treating the two policy networks as two separate actors with a shared critic. The shared critic is motivated by the concepts of self act completion and partner act completion seen in Section 4.1 where the LIGHT agent can speak to convince the partner to achieve an act goal. Each agent in a batch is initialized via priority sampling (Graves et al., 2017) with a different quest, i.e. quests that the agent has historically successfully completed less often are given a greater weight when sampling from the pool of all possible training quests. In addition to a normal entropy regularization term, we also add a regularization term that encourages the models to produce "valid" outputs as judged by the game engine and the DM for actions and utterances respectively. Additional training details are found in Appendix B.2.

Encoder Pre-training Tasks
Prior work on commonsense reasoning in supervised natural language learning (Bosselut et al., 2019) suggests that the encoder is key to overcoming the challenges posed by the LIGHT-Quests dataset even in an RL setting. We describe a series of encoder pre-training tasks, designed to help the LIGHT agent either act more consistently or speak more naturally.
ATOMIC-LIGHT As seen in Section 3, ATOMIC-LIGHT is a (domain-adapted) fantasy commonsense knowledge graph, and as such provides priors for an agent on how to act consistently in the world. For example, given a clause such as "The knight wishes to slay the dragon, as a result the knight needs to acquire a sword," the task would be to predict the underlined text-a form of knowledge graph completion (Wang et al., 2017).
Reddit We use a previously existing Reddit dataset extracted and obtained by a third party and made available on pushshift.io (Baumgartner et al., 2020) seen in (Roller et al., 2020). This dataset has been used in several existing dialogue-based studies and has been shown to result in more natural conversations (Yang et al., 2018;Mazaré et al., 2018).

LIGHT-Original
The original LIGHT dataset (Urbanek et al., 2019) is organized similarly to the human demonstrations found in LIGHT-Quests, i.e. an interspersed sequence of dialogue and actions collected from humans role-playing a character. The task itself is to predict the next action or utterance given the prior dialogue history as well as the current setting and persona for a character. They are collected in a chit-chat fashion, with no notion of objectives, and so provide priors on how to generally act consistently and speak in a fantasy world, but not directly how to complete quests.

LIGHT-Quests
Pre-training with this newly introduced dataset consists of three tasks. (1) Bag-ofaction timeline prediction in which, given a quest consisting of setting, persona, and motivations, any one of the actions in the timeline must be predicted.
(2) Sequential timeline prediction in which, given a quest consisting of setting, persona, motivations, and the first n actions in the timeline, the n + 1 th action must be predicted. (3) Predict the next dialogue utterance given a human demonstration in a manner similar to the LIGHT-original tasks. The first two tasks are designed to help the agent act consistently and the third to help it speak naturally with respect to its motivations.  Table 1: Encoder Type RL Zero-Shot Evaluations averaged over 3 independent runs. Act goals and speech goals are as described in Section 4.1. Standard deviations for all experiments are less than 0.01. The "Act & Speech Goals" column refers to quests where the agent has simultaneously achieved both types of goals within the episode. Human act goal completion = 0.6 as measured during the second phase of the LIGHT-Quests data collection.

Encoder Pre-training Type Ablation Study
Pre-training is done on the tasks described in Section 4.3 by training a 12 layer transformer with 256 million parameters using a cross-entropy loss as seen in (Humeau et al., 2020). These weights are then transferred to the Blue shaded portion of the encoder as seen in Figure 4 and frozen. A further three randomly initialized-layers are appended on to the end, indicated by the Red portions, into which gradients flow. This is done as optimizing all the parameters of such a model via RL over a long horizon is both data inefficient and computationally infeasible. Additional hyperparameter details are found in Appendix B.1. We investigate the following five different pre-training models to see how they compare on act and speech goal completions when trained with RL and in a supervised manner with behavior cloning: Scratch No pre-training is done, the encoder is a 3-layer randomly initialized transformer and trained along with the policy networks.
General Multi-task trained using both pushshift.io Reddit and the commonsense dataset ATOMIC-LIGHT, giving the agent general priors on how to act and speak.
Light Multi-task trained on all tasks in LIGHToriginal and LIGHT-Quests, giving the agent priors on how to act and speak with motivations in the LIGHT fantasy domain.
General+Light Multi-task trained on all tasks used in the General and Light models.
Adaptive Here we adaptively train a Gen-eral+Light model that is first initialized itself from a General model, providing additional regularization to help balance between Light and General tasks. Table 1 describes the results for this ablation. Models were each zero-shot evaluated on 211 human demonstrations from the LIGHT-Quests test set for a single episode per quest across three independent runs. Figure 5 shows learning curves during training for each encoder type. We first see that performance when trained with RL, i.e. with interactivity and environment grounding during training, results in higher performance than behavioral cloning for all the models. In both RL and behavior cloning settings the Adaptive model outperforms all others in all the metrics.
When trained supervised (behavioral cloning), we see trends mirroring standard pre-training in static text corpora. Transfer is easy and the Scratch model performs significantly worse than all others; and each new task added improves the agent's ability to speak and act. In particular, we see that Light outperforms General, showing that the more similar the pre-training tasks are to the downstream tasks, the better the supervised performance.
However, these trends do not hold in the RL setting. The Scratch model outperforms everything except the Adaptive model and General outperforms Light. In part, this may be due to specification gaming (Krakovna et al.); however Adaptive does strongly outperform Scratch in goals with dialogue. This suggests that transfer (and fine-tuning) is not as simple in the RL setting as in the supervised setting, but still can be useful if carefully done. We note that domain adapative pre-training (intermediate task transfer) has previously been shown to give modest gains in supervised learning (Phang et al., 2018;Gururangan et al., 2020), but not with the large effects seen here for RL. Figure 5 further shows that with the right combination of tasks, not only is the generalization performance better, but training itself is more sample efficientrequiring fewer steps before reaching asymptotic performance.

Ability Type Ablation Study
To better understand the interplay between acts and speech resulting in self and partner act goal com-   pletions, we perform an ablation study selectively dropping either the agent's ability to talk or act. We train the agent to either only act, only speak, only speak with only action rewards. In the scenarios when the agent can only speak, the agent has to convince the partner to help achieve the agent's goal.
The results are outlined in Table 2. Unsurprisingly, when trained to only act, the act goal completion rate increases over when it can both act and speak. Similarly, when trained to only speak the speech goal completion rates also increase. We can draw two conclusions from these results: (1) It is much easier to do an action yourself than to convince the partner to do it (2) Removing speech goals increases the act goal completion rates corresponding to higher partner act completions. Thus, the sequences of dialogue utterances required to convince the partner to achieve the agent's goal are likely often at odds with those sequences required to maximize speech goals.

Conclusion
Operating on the hypothesis that interactivity is key to language learning, we introduce two datasets-a set of quests based on character motivations in fantasy worlds, LIGHT-Quests, and a large-scale commonsense knowledge graph, ATOMIC-LIGHTand a reinforcement learning system that leverages transformer-based pre-training to facilitate development of goal-driven agents that can act and speak in situated environments. Zero-shot evaluations on a set of novel human demonstration show that we have trained agents that act consistently and speak naturally with respect to their motivations. A key insight from our ablation study testing for zero-shot generalization on novel quests is that large-scale pre-training in interactive settings require careful selection of pre-training tasks-balancing between giving the agent "general" open domain priors and those more "specific" to the downstream taskwhereas static methodologies require only domain specific pre-training for effective transfer but are ultimately less effective than interactive methods.

Broader Impacts
The ability to speak and act in these textual fantasy worlds has implications for domains beyond text-games. We view text-games as an platform on which to teach agents how to communicate effectively using natural language, to plan via sequential decision making in situations that may not be anticipated. Given that our methods rely on deep-andreinforcement learning techniques operating on language, they are prone to the same pitfalls as other contemporary dialogue and text-game systems. We mitigate, though do not entirely eliminate, the two main pitfalls that our particular system is prone to: (1) non-normative language usage-describing situations that fictional characters may engage in inappropriate for the real world-by restricting our system to a retrieval rather than a generative sys-tem, enabling us to filter the possible outputs of the agent; and (2) dataset bias via curation through controlled crowdsourcing in the case of LIGHT-Quests-the methods to debias the original LIGHT dataset can be found in Dinan et al. (2020)

A Appendix -Datasets
A.1 LIGHT-Quests

A.1.1 Mechanical Turk Data Collection
Crowdworkers are required to first pass an onboarding test before they are allowed to perform the actual task. Figures 6, 7, 8, 9, and 10 describe first the instructions given to the crowdworkers and then 4 phases of the on-boarding test. We paid workers $2.75 per task. This amount was determined by first running the task ourselves to estimate a completion time of 10-12 minutes per task, and then running pilot tasks that confirmed the average task duration for workers was close to 10 minutes. Figure 11 shows the example of the actual task given to the crowdworkers and Figure 12 shows the user interface for the first phase of the LIGHT-Quests data collection task described in Section 3.1.

A.2 Human Demonstration Collection
In order to collect the human completions of quests in the LIGHT environment, we created a game setup where humans could interact with models while playing LIGHT characters in LIGHT settings. We trained a ranking dialogue model on the utterances in the LIGHT dataset.
Using this, players could now assume the role of a LIGHT character and interact with the model. In order to try to control for quality of the quest completions, we used the same ranking model to rank the scores of the player in the dialogues. Players who gave responses that the model ranked as likely candidates would receive more points.
Only after scoring enough cumulative points were players allowed to try completing quests. The quest setup was a slight variation of the conversation setup. First, the player was given one of the collected quest scenarios rather than just a chat setup. Players receiving a quest would be provided with one of the motivations alongside their persona.
In the dialogue that followed, players were given the chance to take action after enough in-character dialogue turns. If the player took the correct action, they were awarded with more points to confirm they completed their given quest.

A.3 Examples
We present 3 randomly selected examples of quests and corresponding human demonstrations.

Setting
You are in the swamp. The swamp is glowing with wonder and color. There are parts that range from dark red to bright yellow. People often visit here to speak with the gods and claim it can be both harmful to those it dislikes and healing to those who it deems worthy. There's a pit of quicksand and a swamp flower here. A witch is here.

Partner:
Witch. Persona I grew up in a nearby village, and was exiled when it was found that I had special abilities. My parents were ostracized as well.

Setting
This is the hidden workshop of the most powerful wizard in the land. There are ornate tapestries on the walls depicting wizards using their powers and potions in battle. Mordak, the wizard, constructed this powerful workshop after the death of the most famous king, Henry of Silverton. Any who enter here immediately become enchanted with the wizard's power, giving them advanced healing powers. There's a tapestry, a potion, and a tome here. The wizard is here.

Partner:
Wizard. Persona I am a wizard who develops my own spells. Most of them aren't particularly effective spells, but I'm curious about all the magical possibilities.
People are afraid to participate in my experiments. Carrying Nothing.

Self:
Apprentice. Persona I am your apprentice. Please tell me what I can help you with. I will cook and serve your meals. I will clean the castle. I can do anything you ask. You have hired me to make your life easier. Carrying Nothing.

Motivations:
Timeline: Short I need to get the tapestry to clean it.
-2 hours get hired from wizard -15 min go to secret magician's workshop Now get tapestry Mid I need to make this workshop suitable for the wizard. +5 min wield tool +10 min hit tapestry Long I was hired to keep this place cleaned and in perfect condition for the wizard. +30 min put tapestry in wall +4 hours drop tool Good day Ser Wizard. Your tower is decorated with beautiful tapestries, though their colors appear to be dulled due to dust. May I take it and clean it?
Why not, it is infused isn't it. Just don't be waving it around this room, it might get dangrous Of course, I will handle it with the utmost care.
How long have you been an apprentice?
get tapestry 3 years Ser. I'm hoping to learn to be a wizard or to become a knight. Or both! Wouldn't that be grand?
How wonderful. What encouraged you to pursue it? Curiosity mostly. I hope to make the world a better place, and one of the best ways to do that is vanquishing evil What got you into that occupation then? I was born with affinity for magic so it was my calling.
hug wizard As I said, curiosity. I am a high born boy, the third son, so I cannot inherit my father's lands. So I must make my mark on the world another way You are well suited to it and I am sure your parents are proud of you.

Setting
You are in the The Queen's Chamber. This is a beautiful room inside of the palace that is decorated with the finest silk and velvet. The color scheme used represents royalty, royal blue, red, green and purple. The walls are covered in gold and in each corner of the room are golden statues of Greek art. The floors are covered in marble, and despite the patterns, shine so brightly you can even see your own reflection in them! There's also a bed big enough to fit five people on! There's two statues, an a bed big, a the finest silk and velvet, an a bed, and a finest silk and velvet here. The butler is here.

Partner:
Butler. Persona I serve my masters quietly. I know all the secrets of the elite but will never tell a soul. I have lived in this home since I was 12. Carrying Nothing.

Self:
Jester. Persona I am the fun guy. I like to entertain others in the village. I am the local jester. Carrying Nothing. Why so down with the life feels huh I can't complain (because the king will punish me) everyone wishes they could be the king.

Motivations
hug butler I appreciate the kind words, dear jester.
I'm here for ya. To cheer you up That is kind of you, not everyone has liked me here, I am the queen's least favorite person.
Well I like you much more than the queen.

A.4 ATOMIC-LIGHT
ATOMIC-LIGHT is constructed by first fine-tuning a BERT-large model (Devlin et al., 2018) on all setting, object, and descriptions in LIGHT in addition all the human demonstrations found in LIGHT and LIGHT-Quests. As seen in Section 3.2, all nouns (e.g. PersonX or PersonY) and noun phrases are masked out and we the tuned BERT model to fill it in a manner similar to (Lawrence et al., 2019). When filling in tokens, the BERT model is restricted to a vocabulary consisting of all nouns (N or NN) in LIGHT and to a vocabulary constructed from all of LIGHT for the rest of the noun phrase (NP).
Here we present 3 examples from ATOMIC-LIGHT as seen in Section 3.2 for each of the 4 relation types used: "wants", "needs", "intents", and "effects". Model-types are the same as those used in the encoders in Section 5 in the main paper. All retrieval results reported are Hits@X/100. Results are reported for all timeline actions, all actions with the exception of the easiest action-the action at the "now" position in the timeline, corresponding most closely to the short motivation as a result of the framing of Mechanical Turk task in Figure 12and only the easiest action prediction. Table 3 gives details on hyperparameters used to train the poly-encoders. Encoders were trained until validation accuracy across all the tasks did not improve for 5 epochs or 24 wall clock hours on a machine with 8 V100 GPUs.  Some notable common trends across these tasks are: 1. Removing motivations from the input context results in significantly lower performanceon average ≈ 7 points lower accuracy for Bag of Actions Timeline prediction and on average ≈ 18 percentage points lower for Sequential Timeline prediction when averaged across Scratch and Adaptive models. Further, the short motivations proves to be the most useful for timeline prediction tasks.
2. Pre-training on ATOMIC-LIGHT produces an average gain of ≈ 4 percentage points in accuracy in both tasks than when trained on ATOMIC without domain adaptation alone.

B.2 Reinforcement Learning
This section contains first the equations referenced, hyperparameters used as well as additional results for the reinforcement learning tasks as seen in Sec-tion 4.
The additional entropy loss terms over the valid actions are designed to speed up exploration, as seen in (Ammanabrolu and Hausknecht, 2020). (1) Each of these loss terms are only applied to the relevant policy network, i.e. L A to the action network and L U to the dialogue network. These terms provide an additional training signal to the policy networks regarding which actions and dialogue are contextually relevant via additional entropy regularization over the valid actions. Similarly to the results found in (Ammanabrolu and Hausknecht, 2020), preliminary experiments in our domain suggest that these terms reduce the number of environment steps required to reach asymptotic performance by a couple orders of magnitude.
Overall training is done via A2C (Mnih et al., 2016) a policy gradient algorithm that maximizes long-term expected reward by comparing the advantage A(s t , a * t ) of taking an action in a state to the average value of taking a valid action as predicted by the critic V (s t ).
where r t = r At + r Ut Here, a * t is either an action or an utterance outputted by the respective policy networks. It is also worth noting that on steps where an action is performed, r Ut is always 0, but on steps where a dialogue utterance is spoken r At may not be 0. This corresponds to the concepts of self act completion and partner act completion seen in Section 4.1 where the LIGHT agent can speak to convince the partner to achieve an act goal. Both policies are then updated according to the gradient Where π S : O → {π A , π U } is the switch policy that selects whether the agent acts according to π A or speaks according to π U based on the encoded state s t . The additional terms seen are an overall entropy loss over the entire action A or utterance U spaces, designed to prevent premature, sub-optimal policy convergence. Boltzmann exploration (Sutton et al., 1998) is used to sample actions from both actor networks during training. Table 6 has the hyperparameters used in the RL experiments. Loss coefficients are separated by action and speech types, note that the ratio between the loss coefficients matches the ratio between the sizes of the action spaces. RL experiments were performed on a machine with 8 V100 GPUs for 1 million environment interactions for each actor in a batch of 32.

B.2.2 Learning Curves
The first set of results, seen in Figure 15 shows that both Scratch and Adaptive models gain performance across the board in terms of their ability to act and speak given more training quests. Unlike the supervised tasks, the Scratch model generally benefits less than the Adaptive model from having more data.

B.2.3 Switch Type Ablations
The second set of results involve ablating having a learned switch that uses the input training data and a hardcoded switch-The learned switch is as described in Section 4: it outputs an action every k dialogue utterances; where during training k is chosen to match the ratio of utterances to actions on that particular quest from the human demonstrations, and during testing, k is chosen to match the average action to utterance ratio. The hardcoded switch is where the agent outputs an action chosen every N steps across all quests-here N = 3 is the chosen hyperparameter. Table 7 shows that having a learned switch increases zero-shot generalization performance and Figures 16, 17 show that having a learned switch improves sample efficiency by enabling the LIGHT agent to reach asymptotic performance in fewer steps in both the Scratch and Adaptive models.  Table 7: Encoder Type RL Zero-Shot Evaluations averaged over 3 independent runs. Act goals and speech goals are as described in Section 4.1. Standard deviations for all experiments are less than 0.01. The "Act & Speech Goals" column refers to quests where the agent has simultaneously achieved both types of goals within the allotted one episode.

B.2.4 Self Act Completion Transcripts
We pick 3 transcripts of the LIGHT agent playing the quests. Each of these transcripts are from the Adaptive model on the test set of human demonstrations during zero-shot evaluations. We pick samples where the agent achieves the act goal itself and also simultaneously achieves the speech goal. The blue and right aligned text is the LIGHT agent trained with RL and the gray and left aligned text is the partner agent.

Setting
The fishing store is a small one room stone building with wares laid about on tables. One can see fishing poles, wooden buckets with dirt and bait inside, along with some mounted trophies, and a skeleton.

Setting
The king's bedroom. The walls are tall and stone. They are coated with colorful tapestries showing the kings of years past. A large stone fireplace across from the bed that is large enough to keep the king warm even on the coldest nights. A double thick wooden door with a large lock on each side of the room. Partner: Royal dog. Persona I am the royal dog, fat, incontinent, and lazy. I eat off my own porcelain plate, when I am not hand fed by the queen. I pee where I like. I stole food from the poor when I was young, but I cannot waddle up to them fast enough these days. I sleep between the royal couple. Carrying Nothing.

Self:
Queen. Persona I was the daughter of a high ranking nobleman overseas. To make a trade alliance with the King, my parents offered me in marriage. It wasn't my idea, but it has turned out very well. I've produced two living sons as heirs, and the king treats me kindly. I spend my time doing embroidery and talking with my ladies in waiting. Carrying Nothing. Motivation I want to get to the large stone fireplace.
I am having turbulent thoughts regarding my faith and own morality.
I want to jump into the fireplace.
Don't, lets go to the fireplace and sit.
I go to fireplace.
get large stone fireplace