Mapping Instructions and Visual Observations to Actions with Reinforcement Learning

We propose to directly map raw visual observations and text input to actions for instruction execution. While existing approaches assume access to structured environment representations or use a pipeline of separately trained models, we learn a single model to jointly reason about linguistic and visual input. We use reinforcement learning in a contextual bandit setting to train a neural network agent. To guide the agent’s exploration, we use reward shaping with different forms of supervision. Our approach does not require intermediate representations, planning procedures, or training different models. We evaluate in a simulated environment, and show significant improvements over supervised learning and common reinforcement learning variants.


Introduction
An agent executing natural language instructions requires robust understanding of language and its environment. Existing approaches addressing this problem assume structured environment representations (e.g.,. Chen and Mooney, 2011;Mei et al., 2016), or combine separately trained models (e.g., Matuszek et al., 2010;Tellex et al., 2011), including for language understanding and visual reasoning. We propose to directly map text and raw image input to actions with a single learned model. This approach offers multiple benefits, such as not requiring intermediate representations, planning procedures, or training multiple models. Figure 1 illustrates the problem in the Blocks environment (Bisk et al., 2016). The agent observes the environment as an RGB image using a camera sensor. Given the RGB input, the agent North South

East West
Put the Toyota block in the same row as the SRI block, in the first open space to the right of the SRI block Move Toyota to the immediate right of SRI, evenly aligned and slightly separated Move the Toyota block around the pile and place it just to the right of the SRI block Place Toyota block just to the right of The SRI Block Toyota, right side of SRI Figure 1: Instructions in the Blocks environment. The instructions all describe the same task. Given the observed RGB image of the start state (large image), our goal is to execute such instructions. In this task, the direct-line path to the target position is blocked, and the agent must plan and move the Toyota block around. The small image marks the target and an example path, which includes 34 steps. must recognize the blocks and their layout. To understand the instruction, the agent must identify the block to move (Toyota block) and the destination (just right of the SRI block). This requires solving semantic and grounding problems. For example, consider the topmost instruction in the figure. The agent needs to identify the phrase referring to the block to move, Toyota block, and ground it. It must resolve and ground the phrase SRI block as a reference position, which is then modified by the spatial meaning recovered from the same row as or first open space to the right of, to identify the goal position. Finally, the agent needs to generate actions, for example moving the Toyota block around obstructing blocks.
To address these challenges with a single model, we design a neural network agent. The agent executes instructions by generating a sequence of actions. At each step, the agent takes as input the instruction text, observes the world as an RGB image, and selects the next action. Action execution changes the state of the world. Given an observation of the new world state, the agent selects the next action. This process continues until the agent indicates execution completion. When selecting actions, the agent jointly reasons about its observations and the instruction text. This enables decisions based on close interaction between observations and linguistic input.
We train the agent with different levels of supervision, including complete demonstrations of the desired behavior and annotations of the goal state only. While the learning problem can be easily cast as a supervised learning problem, learning only from the states observed in the training data results in poor generalization and failure to recover from test errors. We use reinforcement learning (Sutton and Barto, 1998) to observe a broader set of states through exploration. Following recent work in robotics (e.g., Levine et al., 2016;Rusu et al., 2016), we assume the training environment, in contrast to the test environment, is instrumented and provides access to the state. This enables a simple problem reward function that uses the state and provides positive reward on task completion only. This type of reward offers two important advantages: (a) it is a simple way to express the ideal agent behavior we wish to achieve, and (b) it creates a platform to add training data information.
We use reward shaping  to exploit the training data and add to the reward additional information. The modularity of shaping allows varying the amount of supervision, for example by using complete demonstrations for only a fraction of the training examples. Shaping also naturally associates actions with immediate reward. This enables learning in a contextual bandit setting (Auer et al., 2002;Langford and Zhang, 2007), where optimizing the immediate reward is sufficient and has better sample complexity than unconstrained reinforcement learning (Agarwal et al., 2014).
We evaluate with the block world environment and data of Bisk et al. (2016), where each instruction moves one block (Figure 1). While the original task focused on source and target prediction only, we build an interactive simulator and formu-late the task of predicting the complete sequence of actions. At each step, the agent must select between 81 actions with 15.4 steps required to complete a task on average, significantly more than existing environments (e.g., Chen and Mooney, 2011). Our experiments demonstrate that our reinforcement learning approach effectively reduces execution error by 24% over standard supervised learning and 34-39% over common reinforcement learning techniques. Our simulator, code, models, and execution videos are available at: https: //github.com/clic-lab/blocks.

Technical Overview
Task Let X be the set of all instructions, S the set of all world states, and A the set of all actions. An instructionx ∈ X is a sequence x 1 , . . . , x n , where each x i is a token. The agent executes instructions by generating a sequence of actions, and indicates execution completion with the special action STOP. Action execution modifies the world state following a transition function T : S × A → S. The executionē of an instructionx starting from s 1 is an m-length sequence (s 1 , a 1 ), . . . , (s m , a m ) , where s j ∈ S, a j ∈ A, T (s j , a j ) = s j+1 and a m = STOP. In Blocks (Figure 1), a state specifies the positions of all blocks. For each action, the agent moves a single block on the plane in one of four directions (north, south, east, or west). There are 20 blocks, and 81 possible actions at each step, including STOP. For example, to correctly execute the instructions in the figure, the agent's likely first action is TOYOTA-WEST, which moves the Toyota block one step west. Blocks can not move over or through other blocks. Model The agent observes the world state via a visual sensor (i.e., a camera). Given a world state s, the agent observes an RGB image I generated by the function IMG(s). We distinguish between the world state s and the agent context 1s , which includes the instruction, the observed image IMG(s), images of previous states, and the previous action. To map instructions to actions, the agent reasons about the agent contexts to generate a sequence of actions. At each step, the agent generates a single action. We model the agent with a neural network policy. At each step j, the network takes as input the current agent contexts j , and predicts the next action to execute a j . We formally define the agent context and model in Section 4. Learning We assume access to training data 1 is a start state, andē (i) is an execution demonstration ofx (i) starting at s (i) 1 . We use policy gradient (Section 5) with reward shaping derived from the training data to increase learning speed and exploration effectiveness (Section 6). Following work in robotics (e.g., Levine et al., 2016), we assume an instrumented environment with access to the world state to compute the reward during training only. We define our approach in general terms with demonstrations, but also experiment with training using goal states. Evaluation We evaluate task completion error on a test set {(x (i) , s 1 is a start state, and s (i) g is the goal state. We measure execution error as the distance between the final execution state and s

Related Work
Learning to follow instructions was studied extensively with structured environment representations, including with semantic parsing (Chen and Mooney, 2011;Mooney, 2012, 2013;Artzi and Zettlemoyer, 2013;Artzi et al., 2014a,b;Misra et al., 2015Misra et al., , 2016, alignment models (Andreas and Klein, 2015), reinforcement learning (Branavan et al., 2009(Branavan et al., , 2010Vogel and Jurafsky, 2010), and neural network models (Mei et al., 2016). In contrast, we study the problem of an agent that takes as input instructions and raw visual input. Instruction following with visual input was studied with pipeline approaches that use separately learned models for visual reasoning (Matuszek et al., 2010(Matuszek et al., , 2012Tellex et al., 2011;Paul et al., 2016). Rather than decomposing the problem, we adopt a single-model approach and learn from instructions paired with demonstrations or goal states. Our work is related to Sung et al. (2015). While they use sensory input to select and adjust a trajectory observed during training, we are not restricted to training sequences. Executing instructions in non-learning settings has also received significant attention (e.g., Winograd, 1972;Webber et al., 1995;MacMahon et al., 2006).
Our work is related to a growing interest in problems that combine language and vision, in-cluding visual question answering (e.g., Antol et al., 2015;Andreas et al., 2016b,a), caption generation (e.g., Chen et al., 2015Xu et al., 2015), and visual reasoning (Johnson et al., 2016;Suhr et al., 2017). We address the prediction of the next action given a world image and an instruction.
Reinforcement learning with neural networks has been used for various NLP tasks, including text-based games (Narasimhan et al., 2015;He et al., 2016), information extraction (Narasimhan et al., 2016), co-reference resolution (Clark and Manning, 2016), and dialog .
Neural network reinforcement learning techniques have been recently studied for behavior learning tasks, including playing games (Mnih et al., 2013(Mnih et al., , 2015(Mnih et al., , 2016 and solving memory puzzles (Oh et al., 2016). In contrast to this line of work, our data is limited. Observing new states in a computer game simply requires playing it. However, our agent also considers natural language instructions. As the set of instructions is limited to the training data, the set of agent contexts seen during learning is constrained. We address the data efficiency problem by learning in a contextual bandit setting, which is known to be more tractable (Agarwal et al., 2014), and using reward shaping to increase exploration effectiveness. Zhu et al. (2017) address generalization of reinforcement learning to new target goals in visual search by providing the agent an image of the goal state. We address a related problem. However, we provide natural language and the agent must learn to recognize the goal state.
Reinforcement learning is extensively used in robotics (Kober et al., 2013). Similar to recent work on learning neural network policies for robot control (Levine et al., 2016;Schulman et al., 2015;Rusu et al., 2016), we assume an instrumented training environment and use the state to compute rewards during learning. Our approach adds the ability to specify tasks using natural language.

Model
We model the agent policy π with a neural network. The agent observes the instruction and an RGB image of the world. Given a world state s, the image I is generated using the function IMG(s). The instruction execution is generated one step at a time. At each step j, the agent observes an image I j of the current world state s j and the instructionx, predicts the action a j , and executes it to transition to the next state s j+1 .

Place the Toyota east of SRĪ
x : Previous Action a9 Agent Contexts10 Figure 2: Illustration of the policy architecture showing the 10th step in the execution of the instruction Place the Toyota east of SRI in the state from Figure 1. The network takes as input the instructionx, image of the current state I 10 , images of previous states I 8 and I 9 (with K = 2), and the previous action a 9 . The text and images are embedded with LSTM and CNN. The actions are selected with the task specific multi-layer perceptron.
This process continues until STOP is predicted and the agent stops, indicating instruction completion.
The agent also has access to K images of previous states and the previous action to distinguish between different stages of the execution (Mnih et al., 2015). Figure 2 illustrates our architecture. Formally, 2 at step j, the agent considers an agent contexts j , which is a tuple (x, I j , I j−1 , . . . , I j−K , a j−1 ), wherex is the natural language instruction, I j is an image of the current world state, the images I j−1 , . . . , I j−K represent K previous states, and a j−1 is the previous action. The agent context includes information about the current state and the execution. Considering the previous action a j−1 allows the agent to avoid repeating failed actions, for example when trying to move in the direction of an obstacle. In Figure 2, the agent is given the instruction Place the Toyota east of SRI, is at the 10-th execution step, and considers K = 2 previous images.
We generate continuous vector representations for all inputs, and jointly reason about both text and image modalities to select the next action. We use a recurrent neural network (RNN; Elman, 1990) with a long short-term memory (LSTM; Hochreiter and Schmidhuber, 1997) recurrence to map the instructionx = x 1 , . . . , x n to a vector representationx. Each token x i is mapped to a fixed dimensional vector with the learned embedding function ψ(x i ). The instruction representationx is computed by applying the LSTM recurrence to generate a sequence of hidden states l i = LSTM(ψ(x i ), l i−1 ), and computing the meanx = 1 n n i=1 l i (Narasimhan et al., 2015). The current image I j and previous images I j−1 ,. . . ,I j−K are concatenated along the channel dimension and embedded with a convolutional neural network (CNN) to generate the vi-sual state v (Mnih et al., 2013). The last action a j−1 is embedded with the function ψ a (a j−1 ). The vectors v j ,x, and ψ a (a j−1 ) are concatenated to create the agent context vector representatioñ To compute the action to execute, we use a feedforward perceptron that decomposes according to the domain actions. This computation selects the next action conditioned on the instruction text and observations from both the current world state and recent history. In the block world domain, where actions decompose to selecting the block to move and the direction, the network computes block and direction probabilities. Formally, we decompose an action a to direction a D and block a B . We compute the feedforward network: and the action probability is a product of the component probabilities: At the beginning of execution, the first action a 0 is set to the special value NONE, and previous images are zero matrices. The embedding function ψ is a learned matrix. The function ψ a concatenates the embeddings of a D j−1 and a B j−1 , which are obtained from learned matrices, to compute the embedding of a j−1 . The model parameters θ include , the parameters of the LSTM recurrence, the parameters of the convolutional network CNN, and the embedding matrices. In our experiments (Section 7), all parameters are learned without external resources.

Learning
We use policy gradient for reinforcement learning (Williams, 1992) to estimate the parameters θ of the agent policy. We assume access to a 1 is a start state, and e (i) is an execution demonstration starting from s (i) 1 of instructionx (i) . The main learning challenge is learning how to execute instructions given raw visual input from relatively limited data. We learn in a contextual bandit setting, which provides theoretical advantages over general reinforcement learning. In Section 8, we verify this empirically. Reward Function The instruction execution problem defines a simple problem reward to measure task completion. The agent receives a positive reward when the task is completed, a negative reward for incorrect completion (i.e., STOP in the wrong state) and actions that fail to execute (e.g., when the direction is blocked), and a small penalty otherwise, which induces a preference for shorter trajectories. To compute the reward, we assume access to the world state. This learning setup is inspired by work in robotics, where it is achieved by instrumenting the training environment (Section 3). The agent, on the other hand, only uses the agent context (Section 4). When deployed, the system relies on visual observations and natural language instructions only. The reward function R (i) : S × A → R is defined for each training example (x (i) , s where m (i) is the length ofē (i) . The reward function does not provide intermediate positive feedback to the agent for actions that bring it closer to its goal. When the agent explores randomly early during learning, it is unlikely to encounter the goal state due to the large number of steps required to execute tasks. As a result, the agent does not observe positive reward and fails to learn. In Section 6, we describe how reward shaping, a method to augment the reward with additional information, is used to take advantage of the training data and address this challenge. Policy Gradient Objective We adapt the policy gradient objective defined by Sutton et al. (1999) to multiple starting states and reward functions: 1 ) is the value given by R (i) starting from s (i) 1 under the policy π. The summation expresses the goal of learning a behavior parame-terized by natural language instructions. Contextual Bandit Setting In contrast to most policy gradient approaches, we apply the objective to a contextual bandit setting where immediate reward is optimized rather than total expected reward. The primary theoretical advantage of contextual bandits is much tighter sample complexity bounds when comparing upper bounds for contextual bandits (Langford and Zhang, 2007) even with an adversarial sequence of contexts (Auer et al., 2002) to lower bounds (Krishnamurthy et al., 2016) or upper bounds (Kearns et al., 1999) for total reward maximization. This property is particularly suitable for the few-sample regime common in natural language problems. While reinforcement learning with neural network policies is known to require large amounts of training data (Mnih et al., 2015), the limited number of training sentences constrains the diversity and volume of agent contexts we can observe during training. Empirically, this translates to poor results when optimizing the total reward (REINFORCE baseline in Section 8). To derive the approximate gradient, we use the likelihood ratio method: where reward is computed from the world state but policy is learned on the agent context. We approximate the gradient using sampling. This training regime, where immediate reward optimization is sufficient to optimize policy parameters θ, is enabled by the shaped reward we introduce in Section 6. While the objective is designed to work best with the shaped reward, the algorithm remains the same for any choice of reward definition including the original problem reward or several possibilities formed by reward shaping. Entropy Penalty We observe that early in training, the agent is overwhelmed with negative reward and rarely completes the task. This results in the policy π rapidly converging towards a suboptimal deterministic policy with an entropy of 0. To delay premature convergence we add an entropy term to the objective (Williams and Peng, 1991;Mnih et al., 2016). The entropy term encourages a uniform distribution policy, and in practice stimulates exploration early during training. The regularized gradient is: E[∇ θ log π(s, a)R (i) (s, a) + λ∇ θ H(π(s, ·))] , Algorithm 1 Policy gradient learning , learning rate µ, epochs T , horizon J, and entropy regularization term λ. Definitions: IMG(s) is a camera sensor that reports an RGB image of state s. π is a probabilistic neural network policy parameterized by θ, as described in Section 4. EXECUTE(s, a) executes the action a at the state s, and returns the new state. R (i) is the reward function for example i. ADAM(∆) applies a per-feature learning rate to the gradient ∆ (Kingma and Ba, 2014). Output: Policy parameters θ.
1: » Iterate over the training data. 2: for t = 1 to T , i = 1 to N do 3: I1−K , . . . , I0 = 0 4: a0 = NONE, s1 = s (i) 1 5: j = 1 6: » Rollout up to episode limit. 7: while j ≤ J and aj = STOP do 8: » Observe world and construct agent context. 9: Ij = IMG(sj) 10:sj = (x (i) , Ij, Ij−1, . . . , Ij−K , a d j−1 ) 11: » Sample an action from the policy. 12: aj ∼ π(sj, a) 13: sj+1 = EXECUTE(sj , aj) 14: » Compute the approximate gradient. 15: ∆j ← ∇ θ log π(sj, aj)R (i) (sj, aj) +λ∇ θ H(π(sj, ·)) 16: where H(π(s, ·)) is the entropy of π given the agent contexts, λ is a hyperparameter that controls the strength of the regularization. While the entropy term delays premature convergence, it does not eliminate it. Similar issues are observed for vanilla policy gradient (Mnih et al., 2016). Algorithm Algorithm 1 shows our learning algorithm. We iterate over the data T times. In each epoch, for each training example (x (i) , s (i) 1 ,ē (i) ), i = 1 . . . N , we perform a rollout using our policy to generate an execution (lines 7 -16). The length of the rollout is bound by J, but may be shorter if the agent selected the STOP action. At each step j, the agent updates the agent contexts j (lines 9 -10), samples an action from the policy π (line 12), and executes it to generate the new world state s j+1 (line 13). The gradient is approximated using the sampled action with the computed reward R (i) (s j , a j ) (line 15). Following each rollout, we update the parameters θ with the mean of the gradients using ADAM (Kingma and Ba, 2014).

Reward Shaping
Reward shaping is a method for transforming a reward function by adding a shaping term to the Low High Figure 3: Visualization of the shaping potentials for two tasks. We show demonstrations (blue arrows), but omit instructions. To visualize the potentials intensity, we assume only the target block can be moved, while rewards and potentials are computed for any block movement. We illustrate the sparse problem reward (left column) as a potential function and consider only its positive component, which is focused on the goal. The middle column adds the distance-based potential. The right adds both potentials. problem reward. The goal is to generate more informative updates by adding information to the reward. We use this method to leverage the training demonstrations, a common form of supervision for training systems that map language to actions. Reward shaping allows us to fully use this type of supervision in a reinforcement learning framework, and effectively combine learning from demonstrations and exploration.
Adding an arbitrary shaping term can change the optimality of policies and modify the original problem, for example by making bad policies according to the problem reward optimal according to the shaped function. 3   and Wiewiora et al. (2003) outline potential-based terms that realize sufficient conditions for safe shaping. 4 Adding a shaping term is safe if the order of policies according to the shaped reward is identical to the order according to the original problem reward. While safe shaping only applies to optimizing the total reward, we show empirically the effectiveness of the safe shaping terms we design in a contextual bandit setting.
We introduce two shaping terms. The final shaped reward is a sum of them and the problem reward. Similar to the problem reward, we define example-specific shaping terms. We modify the reward function signature as required. Distance-based Shaping (F 1 ) The first shaping term measures if the agent moved closer to the goal state. We design it to be a safe potential-based term : The potential φ (i) 1 (s) is proportional to the negative distance from the goal state s where η is a constant scaling factor, and . is a distance metric. In the block world, the distance between two states is the sum of the Euclidean distances between the positions of each block in the two states, and η is the inverse of block width. The middle column in Figure 3 visualizes the potential φ (i) 1 . Trajectory-based Shaping (F 2 ) Distancebased shaping may lead the agent to sub-optimal states, for example when an obstacle blocks the direct path to the goal state, and the agent must temporarily increase its distance from the goal to bypass it. We incorporate complete trajectories by using a simplification of the shaping term introduced by Brys et al. (2015). Unlike F 1 , it requires access to the previous state and action. It is based on the look-back advice shaping term of Wiewiora et al. (2003), who introduced safe potential-based shaping that considers the previous state and action. The second term is: a 1 ), . . . , (s m , a m ) , to compute the potential φ (i) 2 (s, a), we identify the closest state s j inē (i) to s. If η s j − s < 1 and a j = a, φ (i) 2 (s, a) = 1.0, else φ (i) 2 (s, a) = −δ f , where δ f is a penalty parameter. We use the same distance computation and parameter η as in F 1 . When the agent is in a state close to a demonstration state, this term encourages taking the action taken in the related demonstration state. The right column in Figure 3 visualizes the effect of the potential φ

Experimental Setup
Environment We use the environment of Bisk et al. (2016). The original task required predicting the source and target positions for a single block given an instruction. In contrast, we address the task of moving blocks on the plane to execute instructions given visual input. This requires generating the complete sequence of actions needed to complete the instruction. The environment contains up to 20 blocks marked with logos or digits. Each block can be moved in four directions. Including the STOP action, in each step, the agent selects between 81 actions. The set of actions is constant and is not limited to the blocks present.
The transition function is deterministic. The size of each block step is 0.04 of the board size. The agent observes the board from above. We adopt a relatively challenging setup with a large action space. While a simpler setup, for example decomposing the problem to source and target prediction and using a planner, is likely to perform better, we aim to minimize task-specific assumptions and engineering of separate modules. However, to better understand the problem, we also report results for the decomposed task with a planner.
Data Bisk et al. (2016) collected a corpus of instructions paired with start and goal states. Figure 1 shows example instructions. The original data includes instructions for moving one block or multiple blocks. Single-block instructions are relatively similar to navigation instructions and referring expressions. While they present much of the complexity of natural language understanding and grounding, they rarely display the planning complexity of multi-block instructions, which are beyond the scope of this paper. Furthermore, the original data does not include demonstrations. While generating demonstrations for moving a single block is straightforward, disambiguating action ordering when multiple blocks are moved is challenging. Therefore, we focus on instructions where a single block changes its position between the start and goal states, and restrict demonstration generation to move the changed block. The remaining data, and the complexity it introduces, provide an important direction for future work.
To create demonstrations, we compute the shortest paths. While this process may introduce noise for instructions that specify specific trajectories (e.g., move SRI two steps north and . . . ) rather than only describing the goal state, analysis of the data shows this issue is limited. Out of 100 sampled instructions, 92 describe the goal state rather than the trajectory. A secondary source of noise is due to discretization of the state space. As a result, the agent often can not reach the exact target position. The demonstrations error illustrates this problem (Table 3). To provide task completion reward during learning, we relax the state comparison, and consider states to be equal if the sum of block distances is under the size of one block.
The corpus includes 11,871/1,719/3,177 instructions for training/development/testing. Table 1 shows corpus statistic compared to the commonly used SAIL navigation corpus (MacMahon  , 2006;Chen and Mooney, 2011). While the SAIL agent only observes its immediate surroundings, overall the blocks domain provides more complex instructions. Furthermore, the SAIL environment includes only 400 states, which is insufficient for generalization with vision input. We compare to other data sets in Appendix D.
Evaluation We evaluate task completion error as the sum of Euclidean distances for each block between its position at the end of the execution and in the gold goal state. We divide distances by block size to normalize for the image size. In contrast, Bisk et al. (2016) evaluate the selection of the source and target positions independently.
Systems We report performance of ablations, the upper bound of following the demonstrations (Demonstrations), and five baselines: (a) STOP: the agent immediately stops, (b) RANDOM: the agent takes random actions, (c) SUPERVISED: supervised learning with maximum-likelihood estimate using demonstration state-action pairs, (d) DQN: deep Q-learning with both shaping terms (Mnih et al., 2015), and (e) REINFORCE: policy gradient with cumulative episodic reward with both shaping terms (Sutton et al., 1999). Full system details are given in Appendix B.
Parameters and Initialization Full details are in Appendix C. We consider K = 4 previous images, and horizon length J = 40. We initialize our model with the SUPERVISED model. Table 2 shows development results. We run each experiment three times and report the best result. The RANDOM and STOP baselines illustrate the task complexity of the task. Our approach, including both shaping terms in a contextual bandit setting, significantly outperforms the other methods. SUPERVISED learning demonstrates lower performance. A likely explanation is test-time execution errors leading to unfamiliar states with poor later performance (Kakade and Langford, 2002), a form of the covariate shift problem. The low performance of REINFORCE and DQN illustrates the challenge of general reinforcement learning with limited data due to relatively high sample com-   plexity (Kearns et al., 1999;Krishnamurthy et al., 2016). We also report results using ensembles of the three models. We ablate different parts of our approach. Ablations of supervised initialization (our approach w/o sup. init) or the previous action (our approach w/o prev. action) result in increase in error. While the contribution of initialization is modest, it provides faster learning. On average, after two epochs, we observe an error of 3.94 with initialization and 6.01 without. We hypothesize that the F 2 shaping term, which uses full demonstrations, helps to narrow the gap at the end of learning. Without supervised initialization and F 2 , the error increases to 5.45 (the 0% point in Figure 4). We observe the contribution of each shaping term and their combination. To study the benefit of potential-based shaping, we experiment with a negative distance-to-goal reward. This reward replaces the problem reward and encourages getting closer to the goal (our approach w/distance reward). With this reward, learning fails to converge, leading to a relatively high error. amount of supervision. We remove demonstrations from both supervised initialization and the F 2 shaping term. For example, when only 25% are available, only 25% of the data is available for initialization and the F 2 term is only present for this part of the data. While some demonstrations are necessary for effective learning, we get most of the benefit with only 12.5%. Table 3 provides test results, using the ensembles to decrease the risk of overfitting the development. We observe similar trends to development result with our approach outperforming all baselines. The remaining gap to the demonstrations upper bound illustrates the need for future work.

Results
To understand performance better, we measure minimal distance (min. distance in Tables 2 and  3), the closest the agent got to the goal. We observe a strong trend: the agent often gets close to the goal and fails to stop. This behavior is also reflected in the number of steps the agent takes. While the mean number of steps in development demonstrations is 15.2, the agent generates on average 28.7 steps, and 55.2% of the time it takes the maximum number of allowed steps (40). Testing on the training data shows an average 21.75 steps and exhausts the number of steps 29.3% of the time. The mean number of steps in training demonstrations is 15.5. This illustrates the challenge of learning how to be behave at an absorbing state, which is observed relatively rarely during training. This behavior also shows in our video. 5 We also evaluate a supervised learning variant that assumes a perfect planner. 6 This setup is similar to Bisk et al. (2016), except using raw image input. It allows us to roughly understand how well the agent generates actions. We observe a mean error of 2.78 on the development set, an improvement of almost two points over supervised learning with our approach. This illustrates the com-plexity of the complete problem.
We conduct a shallow linguistic analysis to understand the agent behavior with regard to differences in the language input. As expected, the agent is sensitive to unknown words. For instructions without unknown words, the mean development error is 3.49. It increases to 3.97 for instructions with a single unknown word, and to 4.19 for two. 7 We also study the agent behavior when observing new phrases composed of known words by looking at instructions with new n-grams and no unknown words. We observe no significant correlation between performance and new bi-grams and tri-grams. We also see no meaningful correlation between instruction length and performance. Although counterintuitive given the linguistic complexities of longer instructions, it aligns with results in machine translation (Luong et al., 2015).

Conclusions
We study the problem of learning to execute instructions in a situated environment given only raw visual observations. Supervised approaches do not explore adequately to handle test time errors, and reinforcement learning approaches require a large number of samples for good convergence. Our solution provides an effective combination of both approaches: reward shaping to create relatively stable optimization in a contextual bandit setting, which takes advantage of a signal similar to supervised learning, with a reinforcement basis that admits substantial exploration and easy avenues for smart initialization. This combination is designed for a few-samples regime, as we address. When the number of samples is unbounded, the drawbacks observed in this scenario for optimizing longer term reward do not hold.