Executing Instructions in Situated Collaborative Interactions

We study a collaborative scenario where a user not only instructs a system to complete tasks, but also acts alongside it. This allows the user to adapt to the system abilities by changing their language or deciding to simply accomplish some tasks themselves, and requires the system to effectively recover from errors as the user strategically assigns it new goals. We build a game environment to study this scenario, and learn to map user instructions to system actions. We introduce a learning approach focused on recovery from cascading errors between instructions, and modeling methods to explicitly reason about instructions with multiple goals. We evaluate with a new evaluation protocol using recorded interactions and online games with human users, and observe how users adapt to the system abilities.


Introduction
Sequential instruction scenarios commonly assume only the system performs actions, and therefore only its behavior influences the world state. This ignores the collaborative potential of such interactive scenarios and the challenges it introduces. When the user acts in the world as well, they can adapt to the system abilities not only by adopting simpler language, but also by deciding to accomplish tasks themselves. The system must then recover from errors as new instructions arrive and be robust to changes in the environment that are not a result of its own actions.
In this paper, we introduce CEREALBAR, a collaborative game with natural language instruction, and design modeling, learning, and evaluation methods for the problem of sequential instruction following in collaborative interactions. In CE-REALBAR, two agents, a leader and a follower, * , * * : Equal contribution. All work done at Cornell.  move in a 3D environment and collect valid sets of cards to earn points. A valid set is a set of three cards with distinct color, shape, and count. The game is turn-based, and only one player can act in each turn. In addition to collecting cards, the leader sends natural language instructions to the follower. The follower's role is to execute these instructions. Figure 1 shows a snapshot from the game where the leader plans to pick up a nearby card (red square) and delegates to the follower two cards, one close and the other much further away. Before that, the leader planned ahead and asked the follower to move in preparation for the next set. The agents have different skills to incentivize collaboration. The follower has more moves per turn, but can only see from first-person view, while the leader observes the entire environment but has fewer moves. This makes natural language inter-action key to success. We address the problem of mapping the leader instructions to follower actions. In addition to the collaborative challenges, this requires grounding natural language to resolve spatial relations and references to objects, reason about dependencies on the interaction history, react to the changing environment as cards appear and disappear, and generate actions. CEREALBAR requires reasoning about the changing environment (e.g., when selecting cards) and instructions with multiple goals (e.g., selecting multiple cards). We build on the Visitation Prediction Network model (VPN;Blukis et al., 2018b), which casts planning as mapping instructions to the probability of visiting positions in the environment. Our new model generalizes the planning space of VPN to reason about intermediate goals and obstacles, and includes recurrent action generation for trajectories with multiple goals.
We collect 1,202 human-to-human games for training and evaluation. While our model could be trained from these recorded games only, it would often fail when an instruction would start at the wrong position because of an error in following the previous one. We design a learning algorithm that dynamically augments the data with examples that require recovering from such errors, and train our model to distinguish such recovery reasoning from regular instruction execution.
Evaluation with recorded games poses additional challenges. As agent errors lead to unexpected states, later instructions become invalid. Because measuring task completion from such states is meaningless, we propose cascaded evaluation, a new evaluation protocol that starts the agent at different points in the interaction and measures how much of the remaining instructions it can complete. In contrast to executing complete sequences or single instructions, this method allows to evaluate all instructions while still measuring the effects of error propagation.
We evaluate using both static recorded games and live interaction with human players. Our human evaluation shows users adapt to the system and use the agent effectively, scoring on average 6.2 points, compared to 12.7 for human players. Our data, code, and demo videos are available at lil.nlp.cornell.edu/cerealbar/.

Setup and Technical Overview
We consider a setup where two agents, a leader and a follower, collaborate. Both execute actions in a shared environment. The leader, additionally, instructs the follower using natural language. The leader goal is to maximize the task reward, and the follower goal is to execute leader instructions. We consider a turn-based version, where at each turn only one agent acts. We instantiate this scenario in CEREALBAR, a navigation card game (Figure 1), where a leader and follower move in an environment selecting cards to complete sets. 1 CEREALBAR Overview The objective of CE-REALBAR is to earn points by selecting valid sets of cards. A valid set has three cards with distinct color, shape, and count. When the only cards selected in the world form a valid set, the players receive a point, the selected cards disappear, three new cards are added randomly, and the number of remaining turns increases. The increase in turns decays for each set completed. An agent stepping on a card flips its selection status. The players form sets together. The follower has more steps per turn than the leader. This makes using the follower critical for success. The follower only sees a first-person view of the environment, preventing them from planning themselves, and requiring instructions to be sensible from the follower's perspective. The leader chooses the next target set, plans which of the two players should get which card, and instructs the follower. The follower can not respond to the leader, and should not plan themselves, or risk sabotaging the leader's plan, wasting moves and lowering their potential score. Followers mark an instruction as finished before observing the next one. This provides alignment between instructions and follower actions. In contrast to the original setup that we use for data collection, in our model (Section 4), we assume the follower has full observability, leaving the challenge of partial observability for future work. Appendix A provides further game design details. Problem Setup We distinguish between the world state and the interaction state. Let S be the set of all world states, Γ be the set of all interaction states, and X be the set of all natural language instructions. A world state s ∈ S describes the current environment. In CEREALBAR, the world state describes the spatial environment, the location of cards, whether they are selected or not, and the location of the agents. An interaction state γ ∈ Γ is a tuple Q , α, ψ . The first-in-first-out queueQ = [x q , . . . ,x q ] contains the instructions x i ∈ X available to execute. The current instruction is the left-most instructionx q . The current turn-taker α ∈ {Leader, Follower} indicates the agent currently executing actions, and ψ ∈ IN ≥0 is the number of steps remaining in the current turn.
At each time step, the current turn-taker agent takes an action. An action may be the leader issuing an instruction, or either agent performing an action in the environment. Let A = A w ∪ {DONE} ∪ X be the set of all actions. The set A w includes the actions available to the agents in the environment. In CEREALBAR, this includes moving forward or backward, and turning left or right. Moving onto a card flips it selection status. DONE indicates completing the current instruction for the follower or ending the turn for the leader. An instruction action a =x ∈ X can only be taken by the leader and adds the instructionx to the queuē Q. The effect of each action is determined by the transition function T : S × Γ × A → S × Γ, which is formally defined in Appendix B. Only world actions a ∈ A w decrease the remaining steps ψ.
The goal of the leader is to maximize the total reward of the interaction. An interactionĪ = (s 1 , γ 1 , a 1 ), . . . , (s |Ī| , γ |Ī| , a |Ī| ) is a sequence of state-action tuples, where T (s i , γ i , a i ) = (s i+1 , γ i+1 ). The reward function R : S × A → R assigns a numerical reward to a world state and an action. The total reward of an interactionĪ is |Ī| i=0 R(s i , a i ). In CEREALBAR, the agents receive a reward of 1 when a valid set is selected. Task Our goal is to learn a follower policy to execute the leader instructions. At time t, given the current world and interaction states s t and γ t , and the interaction so farĪ <t , the follower policy π(s t , γ t ,Ī <t ) predicts the next action a t . Model We decompose the follower policy π(s t , γ t ,Ī <t ) to predicting a set of distributions over positions in the environment, including positions to visit, intermediate goals (e.g., cards to select), positions to avoid (e.g., cards not to touch), and positions that are not passable. These distribution are used in a second stage to generate a sequence of actions. Section 4 describes the model. Learning We assume access to a set of N recorded interactions {Ī (i) } N i=1 , and create examples where each instruction is paired with a sequence of state-action tuples. We maximize the action-level cross entropy objective, and use two auxiliary objectives (Section 5). We first train each stage of the model separately, and then fine-tune them jointly. During fine-turning, we continuously generate additional examples using model failures. These examples help the agent to learn how to recover from errors in prior instructions. Evaluation We measure correct execution of instructions and the overall game reward. We assume access to a test set of M recorded interac- We measure instruction-level and interaction-level performance, and develop cascaded evaluation, an evaluation protocol that provides a more graded measure than treating each interaction as a single example, while still accounting for error propagation (Section 6). Finally, we conduct online evaluation with human leaders.

Related Work
Goal-driven natural language interactions have been studied in various scenarios, including dialogue where only one side acts in the world (Anderson et al., 1991;Williams et al., 2013;Vlachos and Clark, 2014;de Vries et al., 2018;Kim et al., 2019;Hu et al., 2019), coordination for agreed selection of an object (He et al., 2017;Udagawa and Aizawa, 2019), and negotiation (Lewis et al., 2017;He et al., 2018). We focus on collaborative interactions where both the user and the system perform sequences of actions in the same environment. This allows the user to adapt to the language understanding ability of the system and balance between delegating goals to it and accomplishing them themselves. For example, a user may decide to complete a short but hard-to-describe task and delegate to the system a long but easy-to-describe one. In prior work, in contrast, recovery is limited to users paraphrasing their requests. The Cards corpus (Djalali et al., 2011(Djalali et al., , 2012Potts, 2012) was used for linguistic analysis of collaborative bi-directional language interaction. The structure of collaborative interactions was also studied using Wizard-of-Oz studies (Lochbaum, 1998;Sidner et al., 2000;Koulouri and Lauria, 2009). In contrast, we focus on building agents that follow instructions. Ilinykh et al. (2019) present a corpus for the related task of natural language coordination in navigation. Collaboration has also been studied for emergent communication (e.g., Andreas et al., 2017;Evtimova et al., 2017).
Understanding sequences of natural language utterances has been addressed using semantic parsing (e.g., Miller et al., 1996;MacMahon et al., 2006;Chen and Mooney, 2011;Artzi and Zettlemoyer, 2013;Artzi et al., 2014;Long et al., 2016;Iyyer et al., 2017;Suhr et al., 2018;Arkin et al., 2017;Broad et al., 2017). Interactions were also used for semantic parser induction (Artzi and Zettlemoyer, 2011;Thomason et al., 2015;Wang et al., 2016). These methods require hand-crafted symbolic meaning representation, while we use low-level actions (Suhr and Artzi, 2018). The interactions in our environment interleave actions of both agents with leader utterances, an aspect not addressed by these methods. Executing single instructions has been widely studied (e.g., Tellex et al., 2011;Duvallet et al., 2013;Misra et al., 2017Misra et al., , 2018Anderson et al., 2018;Blukis et al., 2018a,b;Chen et al., 2019). The distinction we make between actions specified in the instruction and implicit recovery actions is similar to how Artzi and Zettlemoyer (2013) use implicit actions for single instructions. Finally, our model is based on the VPN model of Blukis et al. (2018b). While we assume full observability, their original work did not. This indicates that our model is likely to generalize well to partially observable scenarios.

Model
We use a two-stage model for the follower policy π(s t , γ t ,Ī <t ), where s t is a world state, γ t is an interaction state, andĪ <t is the interaction history. The instructionx that is the first in the queueQ t , which is part of γ t , is the currently executed instruction. In our model, we assume the follower observes the entire environment. First, we map x and s t to distributions over locations in the environment, including what locations to visit and what are the goals. These distributions are considered as an execution plan, and are used to generate a sequence of actions in the second stage. The distribution can also be used to easily easily visualize the agent plan. The first stage is used when starting a new instruction, and the predicted distributions are re-used for all actions for that instruction. Figure 2 illustrates the architecture and the distributions visualization. The two-stage approach was introduced by Blukis et al. (2018b). We generalize its planning space and add a recurrent action generator for execution. Input Representation The inputs to the first stage are the instructionx and the world state s t . We generate feature maps for both. We use a learned embedding function φ X and a bi-directional recurrent neural network (RNN; Elman, 1990) with a long short-term memory cell (LSTM; Hochreiter and Schmidhuber, 1997) RNN X to mapx to a vectorx. The world state s t is a 3D tensor that encodes the properties of each position. The dimensions of s t are P × W × H, where P is the number of properties, and W and H are the environment width and height. Each of the W ×H positions is represented in s t as a binary vector of length P . For example, a position with a red hut will have 1's for the red and hut dimensions and 0's for all other dimensions. We map the world state to a tensor feature map F 0 by embedding s t and processing it using the text representationx. We use a learned embedding function φ S to map each position vector to a dense embedding of size N s by summing embeddings of each of the position's properties. The embeddings are combined to a tensor S of dimension N s × W × H representing a featurized global view of the environment. We create a text-conditioned state representation by creating a kernel K s and convolving with it over S. We use a linear transformation to create K s = W sx + b s , where W s and b s are learned weights. We reshape K s to a 1 × 1 convolution kernel with N s output channels, and compute S = S * K s . We concatenate S and S along the channel dimension and rotate and center so the follower position is at center pixel to generate F 0 . 2 Stage 1: Plan Prediction We treat plan generation as predicting distributions over positions ρ in the environment. There are W × H possible positions. We predict four distributions: (a) p(ρ | s t ,x), the probability of visiting ρ while executing the instructionx; (b) p(GOAL = 1 | ρ, s t ,x), the binary probability that ρ is a goal (i.e., GOAL = 1 when containing a card to select); (c) p(AVOID = 1 | ρ, s t ,x), the binary probability that the agent must not pass in ρ (i.e., AVOID = 1 when it contains a card that should not be touched); and (d) p(NOPASS = 1 | ρ, s t ,x), the binary probability the agent cannot pass in ρ (i.e., NOPASS = 1 when it contains another object).
We use LINGUNET (Misra et al., 2018) to predict the distributions. The inputs to LINGUNET are the instruction embeddingx and featurized world state F 0 , which is relative to the agent's frame of reference. The output are four matrices, each of dimension W × H corresponding to the environment. LINGUNET is formally defined in Misra et al. (2018) and Appendix D. Roughly Fully connected + bias L2 norm

Discriminator auxiliary
Okay, pick up yellow hearts and run past me toward the bush sticking out, on the opposite side is 3 green stars x :

Plan distributions Trajectory distribution
Goal distribution Locations to avoid Impassable locations S Ks S 0 Stage 2: Action generation Figure 2: Illustration of the model architecture. Given the instructionx and the world state s, we compute F 0 from the embeddings of the instructionx and environment S. We use LINGUNET to predict four distributions, which are visualized over the map (grayscaled to emphasize the distributions). We show three action generation steps. Each step receives the map cropped around the agent and the previous action, and outputs the next action.
speaking, LINGUNET reasons about the environment representation F 0 at L levels. First, F 0 is used to generate feature maps of decreasing size F j , j = 1 . . . L using a series of convolutions. We create convolution kernels from the instruction representationx, and apply them to the feature maps F j to generate text-conditioned feature maps G j . Finally, feature maps of increasing size H j are generated using a series of L deconvolutions. The last deconvolution generates a tensor of size 4 × W × H with a channel for each of the four distributions. We use a softmax over one channel to compute p(ρ | s t ,x). Because the other distributions are binary, we use a sigmoid on each value independently for the other channels. When computing p(GOAL = 1 | ρ, s t ,x) and p(AVOID = 1 | ρ, s t ,x) we mask positions without objects that can be changed (i.e., positions without cards) to assign them zero probability.
Stage 2: Action Generation We use the four distributions to generate a sequence of actions. We concatenate the distributions channel-wise to a tensor P ∈ R 4×W ×H . We use a forward LSTM RNN to predict a sequence of actions. At each prediction step t, we rotate, transform, and crop P to generate the egocentric tensor P t ∈ R N ×C×C , where the agent is always at the center and facing in the same direction, such that P t is relative to the agent's current frame of reference. The input to the action generation RNN at time t is: where CNN P is a convolutional layer, RELU is a non-linearity, NORM is instance normalization (Ulyanov et al., 2017), and W P 1 , W P 2 , b P 1 , b P 2 are learned weights. The action probability is: where RNN A is an LSTM RNN, φ A is a learned action embedding function, a 0 is a special START action, and W A and b A are learned. During inference, we assign zero probabilities to actions a when T w (s t , a) is invalid (Appendix B), for example when an agent would move into an obstacle.

Learning
We assume access to a set of N recorded inter- is the first action the follower takes after observing the j-th instruction inĪ (i) , and a (i,j) k is the DONE action completing that instruction. We first estimate the parameters for plan prediction θ 1 and action generation θ 2 separately (Section 5.1), and then finetune jointly with data augmentation (Section 5.2).

Pretraining
Stage 1: Plan Prediction The input of Stage 1 is the world state s 1 and the instructionx at the head of the queueQ. 3 We generate labels for the four output distributions usingĪ (i,j) . The visitation distribution p(ρ | s 1 ,x) label is proportional to number of states s t ∈Ī (i,j) where the follower is in position ρ. The goal and avoidance distributions model how the agent plans to manipulate parts of its environment to achieve the specified goals, but avoid manipulating other parts. In CEREALBAR, this translates to changing the status of cards, or avoiding doing so. For p(GOAL = 1 | ρ, s 1 ,x), we set the label to 1 for all ρ that contain a card that the follower changed its selection status inĪ (i,j) , and 0 for all other positions. Similarly, for the avoidance distribution p(AVOID = 1 | ρ, s 1 ,x), the label is 1 for all ρ that have cards that the follower does not change during the interactionĪ (i,j) . Finally, for p(NOPASS = 1 | ρ, s 1 ,x), the label is 1 for all positions the agent cannot move onto, and zero otherwise. We define four cross-entropy losses: visitation L V , goal L G , avoidance L A , and no passing L P . We also use an auxiliary crossentropy goal-prediction loss L G using a probability p G (GOAL = 1 | ρ, s 1 ,x) we predict from the pre-LINGUNET representation S by classifying each position. The complete loss is a weighted sum with coefficients: 4 Stage 2: Action Generation We use the gold distribution to create the input P, and optimize towards the annotated set of actions using teacher forcing (Williams and Zipser, 1989). We compute the loss only over actions taken by the follower: where p(a t ) is computed by Equation 1.

Fine-tuning with Example Aggregation
Simply combining the separately-trained networks together results in low performance. We perform additional fine-tuning with the two stages combined, and introduce a data augmentation method to learn to recover from error propagation. Error Propagation Executing a sequence of instructions is susceptible to error propagation, where an agent fails to correctly complete an instruction, and because of it also fails on the following ones. While the collaborative, turn-switching setup allows the leader to adjust their plan fol- 3 We omit example indices for succinctness. 4 Additional details are in Appendix E.1. lowing a follower mistake, leaders often strategically issue multiple instructions to use the available follower steps optimally. Given an agent failure, subsequent instructions may not align with the state of the world resulting from the follower's error. In supervised learning, we do not have the opportunity to learn to recover from such errors, even when it is relatively simple. This usually requires exploration. However, conventional frameworks like reinforcement learning (RL) or imitation learning (IL) are poorly suitable. In a live interaction, when an agent makes a mistake (e.g., selecting the wrong card), the leader is likely to adjust their actions. Because of this, in a recorded interaction, which contains the leader actions following a correct execution, it is not possible to reliably compute an RL reward for states following erroneous executions. For similar reasons, we cannot compute an IL oracle. We identify two classes of erroneous states in CEREALBAR: (a) not selecting the correct set of cards; and (b) finishing with the right card selection, but stopping at the wrong position. 5 Case (a) requires to modify the model, for example to know when to skip instructions that refer to a state that is no longer possible. We leave this case for future work. We address case (b) by augmenting the data with new examples that are aggregated during learning. Our process is similar to DAGGER (Ross et al., 2011). We alternate between: (a) collecting new training examples using a heuristic oracle, and (b) performing model updates. We generate training examples that demonstrate recovery by starting in an incorrect initial position for an instruction, having arrived there by executing the previous instruction. We train our model to distinguish between the reasoning required for generating implicit actions to correct errors, and explicit actions directly mentioned in the instruction. Learning with Example Aggregation We alternate between aggregating a new set of recovery examples D and updating our parameters. At each epoch, we first use the current policy to create new training examples. We run inference for each exampleĪ (i,j) in D, the original training set, using the current policy. 6 We compare the state s at the end of execution to the final state inĪ (i,j) to generate an error-recovery exampleĪ (i,j+1) for the subsequent exampleĪ (i,j+1) . We only generate such examples if the position or rotation of the agent are different, and there are no other difference between the states. Starting from s , we generate the shortest-path sequence of actions that: (a) changes the cards as specified inĪ (i,j+1) , and (b) executes DONE in the same position as inĪ (i,j+1) . We then createĪ (i,j+1) usingĪ (i,j+1)  The discriminator classifies each of the L layers in LINGUNET for implicit reasoning. The goal is to encourage implicit reasoning at all levels of reasoning in the first stage. The probability of implicit reasoning for each LINGUNET layer l is: where K IMP l are 1 × 1 learned kernels and AVGPOOL does average pooling. We define a cross-entropy loss L IMP that averages across the L layers. The complete fine-tuning loss is: 7 Appendix E.2 describes this process.

Cascaded Evaluation
Sequential instruction scenarios are commonly evaluated using recorded interactions by executing individual instructions or executing complete interactions starting from their beginning (e.g., Chen and Mooney, 2011;Long et al., 2016). Both have limitations. Instruction-level metrics ignore error propagation, and do not accurately reflect the system's performance. In contrast, interaction-level metrics do consider error propagation and capture overall system performance well. However, they poorly utilize the test data, especially when performance is relatively low. When early failures lead to unexpected world states, later instructions become impossible to follow, and measuring performance on them is meaningless. For example, with our best-performing model, 82% of development instructions become impossible due cascading errors when executing complete interactions.
The two measures may also fail to distinguish models. For example, consider an interaction with three instructions. Two models, A and B, successfully execute the third instruction in isolation, but fail on the two others. They also both fail when executing the entire interaction starting from the beginning. According to common measures, the models are equal. However, if model B can actually recover from failing on the second instruction to successfully execute the third, it means it is better than model A. Both metrics fail to reflect this.
We propose cascaded evaluation, an evaluation protocol for sequential instruction using static corpora. Our method utilizes all instructions during testing, while still accounting for the effect of error propagation. Unlike instruction-level evaluation, cascaded evaluation executes the instructions in sequence. However, instead of starting of starting only from the start state of the first instruction, we create separate examples for starting from the starting state of each instruction in the interaction and continuing until the end of the interaction. For example, given a sequence of three instructions 1, 2, 3 we will create three examples: 1, 2, 3 , 2, 3 , and 3 . To evaluate performance in CEREALBAR, we compute two statistics using cascaded evaluation: the proportion of the remaining instructions followed successfully, and the proportion of potential points scored. We only consider the remaining instructions and points left to achieve in the example. For example, for the sequence 2, 3 , we will subtract any points achieved before the second instruction to compute the proportion of potential points scored. Appendix F describes cascaded evaluation formally.

Experimental Setup
Data We collect 1,202 human-human interactions using Mechanical Turk, split into train (960 games), development (120), and test (122). Appendix C details data collection and statistics. Recorded Interactions Metrics We evaluate instruction-level, interaction-level, and cascaded (Section 6) performance. We allow the follower ten steps per turn, and interleave the actions taken by the leader during each turn in the recorded interaction. Instruction execution often crosses turns. At the instruction-level, we evaluate the mean card state accuracy comparing the state of the cards after inference with the correct card state, environment state accuracy comparing both cards and the agent's final position, and action sequence accuracy comparing the generated action sequence with the correct action sequence. For complete interactions, we measure mean full game points. Finally, for cascaded evaluation, we measure the mean proportion of instructions correctly executed and of possible points scored. Human Evaluation We perform evaluation with human leaders, comparing our model and human followers. Workers are told they will work with a human or an automated follower, but are not told which in each game. We evaluate both human (105 games) and automated agents at the same time (109 games). We evaluate the game scores, and also elicit free-form feedback. Systems We evaluate three systems: (a) the full model; (b) SEQ2SEQ+ATTN: 8 sequence-tosequence with attention; and (c) a static oracle that executes the gold sequence of actions in the recorded interaction. We report mean and standard deviation across three trials for development results. We ablate model and learning components, and additionally evaluate the action generator with access to gold plans. 9 On the test set and for human evaluation, we use the model with the highest proportion of points scored. We provide implementation and learning details in Appendix G. Table 1 shows development and test results, including ablations. We consider the proportion of points scored computed with cascaded evaluation as the main metric. Our complete approach significantly outperforms SEQ2SEQ+ATTN. Key to this difference is the added structure within the model and the direct supervision on it. The results also show the large remaining gap to the static oracle. 10 Our results show how considering error propagation for all available instructions in cascaded evaluation guides different design choices. For example, example aggregation and the implicit discriminator lower performance according to instruction-level metrics, which do not consider error propagation. We see a similar trend for the implicit discriminator when looking at full game points, an interaction-level metric that does not account for performance on over 80% of the data because of error propagation. In contrast, the proportion of points scored computed using cascaded evaluation shows the benefit of both mechanisms.

Results
Our ablations demonstrate the benefit of each model component. All four distributions help. Without the trajectory distribution (-Trajectory distribution), performance drops almost to the level of SEQ2SEQ+ATTN. This indicates the action predictor is not robust enough to construct a path given only the three other disjoint distributions. While the predicted trajectory distribution contains all information necessary to reach the correct cards and goal location, the other three distributions further improve performance. This is likely because redundancy with the trajectory distribution makes the model more robust to noisy predictions in the trajectory distribution. For example, the GOAL distribution guides the agent to move towards goal cards even if the predicted trajectory is discontinuous. The action generation recurrence is also critical (-Action recurrence), allowing the agent to keep track of which locations it already passed when navigating complex paths that branch, loop, or overlap with themselves.
While we observe that each stage separately performs well after pretraining, combining them without fine-tuning (-Fine-tuning) leads to low performance because of the shift in the second stage input. Providing the gold distributions to the action generator illustrates this (+ Gold plan). Removing early goal auxiliary loss L G (Section 5.1) leads to a slight drop in performance on all metrics (-Early goal auxiliary). Learning with aggregated recovery examples helps the model to learn to recover from errors in previous instructions and increases the proportion of points scored (-Example aggregation). However, without the implicit reasoning discriminator (-Implicit discriminator), the additional examples make learning too difficult, and do not help. Finally, removing the language input (-Instructions) significantly decreases performance, showing that the data is relatively robust to observational biases and language is necessary for the task.
In the human evaluation, we observe a mean of 6.2 points (max of 14) with our follower model, compared to 12.7 (max of 20) with human followers. While this shows there is much room for improvement, it illustrates how human leaders adapt and use the agent effectively. One key strategy of adaptation is to use simplified language that fits the model better. This includes shorter instructions, with 8.5 tokens on average with automated followers compared to 12.3 with humans, and a smaller vocabulary, 578 word types with automated followers and 1037 with humans. In general, human leaders commented that they are able to easily distinguish between automated and human followers, and find working with the automated agent frustrating.

Discussion
Our human evaluation highlights several directions for future work. While human leaders adapt to the agent, scoring up to 14 points, there remains a significant gap to collaborations with human followers. Reported errors include getting stuck behind objects, selecting unmentioned cards, going in the wrong direction, and ignoring instructions. At least one worker developed a strategy that took advantage of the agent's full observability, writing instructions with only simple card references. An important direction for future work is to remove our full observability assumption. Other future directions include experimenting with using the interaction history, expanding the learning example aggregation to error cases beyond incorrect start positions, and making agent reasoning interpretable to reduce user frustration. CERE-ALBAR also provides opportunities to study pragmatic reasoning for language understanding (Andreas and Klein, 2016;Fried et al., 2018;Liang et al., 2019). While we currently focus on language understanding by limiting the communication to be unidirectional, bidirectional communication would allow for more natural and efficient collaborations (Potts, 2012;Ilinykh et al., 2019). CEREALBAR could be easily adapted to allow bidirectional communication, and provide a platform to study challenges in language generation. This appendix supplements Section 2 with further game design details and discussion of the reasoning behind them. World View Figure 3 shows the leader's point of view, and Figure 4 shows the follower's. The leader observes the entire environment, while the follower only has access to a restricted first person view. The leader can also toggle to an overhead view to see obstructed cards using the camera button, and has access to the follower's current view to aide them in writing instructions that make sense to the follower. Selected cards are outlined in blue for both players. Invalid selections appear in red for the leader only. This setup makes the follower dependent on the leader, limits the follower ability to plan the card collection strategy, and encourages collaboration. Game Progression The two players switch control of the game by taking turns. During each turn, the follower can take ten (Ψ f = 10) steps while the leader can take five (Ψ l = 5). Allowing the follower more steps than the leader incentivizes delegating lengthier tasks to the follower, such as grabbing multiple cards per turn or moving further away. We do not count actions which do not change the player's location or rotation, such as moving forward into an obstacle, against this limit. We additionally limit the amount of time each player has per turn. This requires players to move quickly without frustrating their partner by taking a long time, and additionally limits the maximum time per game. Both players begin with six turns each. The game ends when the players run out of turns. The leader turn ends once they press the end turn button or after 45 seconds. The end turn button is disabled as long as there are no instructions in the follower queue to nudge the leader to use the follower if time allows it. The allotted 45 seconds allow the leader sufficient time to move, plan, and write instructions. During the leader's turn, they can add any number of new instructions to the queue.
The follower only receives control if there are instructions in the queue. If the queue is empty when the leader finishes their turn, the follower's turn is skipped, but the number of turns remaining still decreases. The follower's turn ends automatically when they run out of steps, after 15 seconds, or when they complete all instructions in the queue. During the follower's turn, they can mark any number of instructions as complete using the DONE action. The follower sees the current and previous instructions, even if there are more instructions in the queue. They must mark the current instruction as complete before seeing the next. This is done to simplify the reasoning available to the follower. For example, to avoid cases where the follower skips a command based on future ones. Because there may be more future instructions in the queue, this incentivizes the follower to not waste moves in the current instruction and be as efficient as possible. During data collection, this provides alignment of actions to instructions because it prohibits a follower from taking actions aligning with a future instruction without marking the current instruction as complete. Without instruction completion annotation, the problem of alignment between instructions and actions becomes much more difficult when processing the recorded interactions. Scoring Points When a valid set is made, the selected cards disappear, and three cards are randomly generated and placed on the grid such that the new grid contains a least one valid set. The two players earn a point, and are given extra turns. The number of added turns decays as they complete more sets, eventually reaching zero added turns. The maximum possible number of turns in a game is 65. In the training data, 454 games reached this number of turns. Adding extra turns when a set is made allows us to collect more data from games that are going well. It also allows us to pay players based on the number of sets completed, and incentivizes them to play as well as possible. If a game is going poorly, e.g., if the pair fails to earn a point in the first six turns, the game will end early. However, if the game is going well, implying the pair is collaborating well, the game will continue for longer, and will contain a longer sequence of instructions.

B CEREALBAR Transition Function
The transition function in CEREALBAR T : S × Γ × A → S × Γ is formally defined in Table 2. Each of the rules in the table is additionally associated with a domain over which it is not defined, for example when α = Follower and a ∈ X (i.e., the follower can not give instructions). The rules are: Rule 1: When an instruction is issued, it is added to the end of the queue. This action does not Rule No.
Rule 2: When the leader ends their turn, and the queue is not empty, control switches to the follower, and the number of steps remaining in the turn is the maximum number for the follower Ψ f .
Rule 3: When the leader ends their turn, and the queue is empty, control does not switch to the follower; instead, a new leader turn begins with Ψ l available steps.
Rule 4: When the leader runs out of remaining steps, control does not immediately switch to the follower. This allows the leader to issue more instructions before manually ending their turn or when their time runs out.
Rule 5: When the follower marks an instruction as finished, and more instructions remain in the queue, the current instruction at the head of the queue is removed. This action does not use a step.
Rule 6: When the follower marks an instruction as finished, if the finished instruction was the last in the queue, control automatically switches to the leader with Ψ l remaining steps.
Rule 7: When the follower runs out of steps in their turn, control immediately switches to the leader with Ψ l remaining steps.
Rule 8: Both agents can take actions which modify the world state s. Each such action a ∈ A w costs a step. We assume access to a domain-specific transition function, T w : S × A w → S, that describes how an environment action modifies the environment.
There may exist combinations of states and actions for which T w is not defined; for example, an agent moving forward onto an obstacle. Additionally, ∀s ∈ S and a ∈ A w , T (s, Q, Leader, 0 , a) results in an invalid state because, while the leader can still issue instructions after running out of steps, they cannot move.

Figures 3 and 4 show the leader's and follower's interfaces.
Crowdsourcing Management We use a qualification task to both teach workers how to play the game and to mark workers as qualified for our main task. We restrict those who can qualify to workers located in majority English-speaking countries with at least 90% approved HITs and at least 100 completed HITs. The qualification task has three components: an interactive tutorial for the leader role, an interactive tutorial for the follower role, and a short quiz about the gameplay. In both tutorials, turn-switching is disabled and workers have an unlimited number of moves to use to complete the tutorial. Each tutorial uses the same map. This allows us to pre-program instructions for the tutorials.
In the leader tutorial, the worker has access to the full game board. They are asked to send a command to the follower, and are instructed via in-game prompts to collect a specific set of cards. Finally, they are asked to collect two more sets in the environment that are valid. Workers who send a command and collect a total of three sets successfully complete this tutorial.
In the follower tutorial, the worker has access only to the follower view. Pre-written commands are issued to the worker, and they must follow them one-by-one to complete a set. The commands include an example of the leader correcting a set-planning mistake. If the worker marks  all commands as finished and successfully collects one set, the follower tutorial is complete.
Finally, workers are asked to read the game instructions and complete a short quiz. They are asked questions regarding the validity of card sets, the responsibilities of both players, and how each game ends.
We maintain two groups of workers split by experience with the game, and use separate pools of HITs for each. A worker can join the expert pool if they have shown they understand how to play as a leader and as a follower through at least one game each. This allows new players to learn the game rules without frustrating expert players. At the end of data collection, 95 workers were in the expert pool while 169 were in the non-expert pool, for a total of 264 participating workers.
We pay workers a bonus per point they earn, increasing the bonus as more points are earned, in addition to a base pay of per game. We do not pay leaders and followers differently. The median game cost was $5.80. The CEREALBAR Dataset In total, we collect 1,526 games played by both experts and nonexperts. Of these, we keep 1,202 (78.8%) games, comprising 23,979 total instructions, discarding those where no instructions were complete, or where alignment between instructions and actions was suspected low-quality. For example, we removed interactions with a low proportion of instructions being marked as complete, or very long action sequences from the follower, both which indicate the follower did not properly complete instructions.
When splitting the data, we ensured the mean score between the three splits was roughly the same. Table 3 shows basic statistics of the data we collected after pruning. 82.6% of post-pruning games are from the expert pool. In the training set, the mean number of completed instructions is 19.9 and the median is 24.0. 83.3% of games have a score greater than zero. We include games with a score of zero if the alignment between instructions and actions is high-quality according to our pruning heuristics. The vocabulary size is computed by lowercasing all word types and tokenizing using the NLTK word tokenizer. Our dataset contains longer interactions than several existing datasets for sequential instruction following and interaction (e.g., Chen and Mooney, 2011;Long et al., 2016;He et al., 2017;de Vries et al., 2018;    The input to LINGUNET are the environment representation F 0 and instruction representation x. LINGUNET consists of three major stages: a series of convolutions on F 0 , a series of text-based convolutions derived fromx, and a series of transposed convolutions to form a final prediction. The output of the LINGUNET is a feature map with the same width and height as F 0 . Each stage has the same number of operations, which we refer to as the depth L. First, a series of L convolutional layers is applied to F 0 . Each layer at depth l is a sequence of two convolution operations separated by a leaky ReLU non-linearity: F l = NORM(RELU(RELU(F l−1 * K C l ) * K C l )) .
We use a stride of two when convolving with K C l , and do not apply NORM when l = L.
In the second stage, the instruction representationx is split into L segmentsx l such that x = [x 1 ; . . . ;x L ] and segments have equal length. Each segment is mapped to a 1×1 kernel K I l using learned weights W I l and biases b I l . K I l is normalized and used to convolve over F l :