Situated Mapping of Sequential Instructions to Actions with Single-step Reward Observation

We propose a learning approach for mapping context-dependent sequential instructions to actions. We address the problem of discourse and state dependencies with an attention-based model that considers both the history of the interaction and the state of the world. To train from start and goal states without access to demonstrations, we propose SESTRA, a learning algorithm that takes advantage of single-step reward observations and immediate expected reward maximization. We evaluate on the SCONE domains, and show absolute accuracy improvements of 9.8%-25.3% across the domains over approaches that use high-level logical representations.


Introduction
An agent executing a sequence of instructions must address multiple challenges, including grounding the language to its observed environment, reasoning about discourse dependencies, and generating actions to complete high-level goals.For example, consider the environment and instructions in Figure 1, in which a user describes moving chemicals between beakers and mixing chemicals together.To execute the second instruction, the agent needs to resolve sixth beaker and last one to objects in the environment.The third instruction requires resolving it to the rightmost beaker mentioned in the second instruction, and reasoning about the set of actions required to mix the colors in the beaker to brown.In this paper, we describe a model and learning approach to map sequences of instructions to actions.Our model considers previous utterances and the world state to select actions, learns to combine simple actions to achieve complex goals, and can be trained using

Goal
throw out first beaker POP 1, STOP pour sixth beaker into last one POP 6, POP 6, PUSH 7 O, PUSH 7 O, STOP it turns brown POP 7, POP 7, POP 7, PUSH 7 B, PUSH 7 B, PUSH 7 B, STOP pour purple beaker into yellow one POP 3, PUSH 5 P, STOP throw out two units of brown one POP 7, POP 7, STOP

Start Goal
Figure 1: Example from the SCONE (Long et al., 2016) ALCHEMY domain, including a start state (top), sequence of instructions, and a goal state (bottom).Each instruction is annotated with a sequence of actions from the set of actions we define for ALCHEMY.
goal states without access to demonstrations.
The majority of work on executing sequences of instructions focuses on mapping instructions to high-level formal representations, which are then evaluated to generate actions (e.g., Chen and Mooney, 2011;Long et al., 2016).For example, the third instruction in Figure 1 will be mapped to mix(prev_arg1), indicating that the mix action should be applied to first argument of the previous action (Long et al., 2016;Guu et al., 2017).In contrast, we focus on directly generating the sequence of actions.This requires resolving references without explicitly modeling them, and learning the sequences of actions required to complete high-level actions; for example, that mixing requires removing everything in the beaker and replacing with the same number of brown items.
A key challenge in executing sequences of instructions is considering contextual cues from both the history of the interaction and the state of the world.Instructions often refer to previously mentioned objects (e.g., it in Figure 1) or actions (e.g., do it again).The world state provides the set of objects the instruction may refer to, and implicitly determines the available actions.For example, liquid can not be removed from an empty beaker.Both types of contexts continuously change during an interaction.As new instructions are given, the instruction history expands, and as the agent acts the world state changes.We propose an attentionbased model that takes as input the current instruction, previous instructions, the initial world state, and the current state.At each step, the model computes attention encodings of the different inputs, and predicts the next action to execute.
We train the model given instructions paired with start and goal states without access to the correct sequence of actions.During training, the agent learns from rewards received through exploring the environment with the learned policy by mapping instructions to sequences of actions.In practice, the agent learns to execute instructions gradually, slowly correctly predicting prefixes of the correct sequences of increasing length as learning progress.A key challenge is learning to correctly select actions that are only required later in execution sequences.Early during learning, these actions receive negative updates, and the agent learns to assign them low probabilities.This results in an exploration problem in later stages, where actions that are only required later are not sampled during exploration.For example, in the ALCHEMY domain shown in Figure 1, the agent behavior early during execution of instructions can be accomplished by only using POP actions.As a result, the agent quickly learns a strong bias against PUSH actions, which in practice prevents the policy from exploring them again.We address this with a learning algorithm that observes the reward for all possible actions for each visited state, and maximizes the immediate expected reward.
We evaluate our approach on SCONE (Long et al., 2016), which includes three domains, and is used to study recovering predicate logic meaning representations for sequential instructions.We study the problem of generating a sequence of low-level actions, and re-define the set of actions for each domain.For example, we treat the beakers in the ALCHEMY domain as stacks and use only POP and PUSH actions.Our approach robustly learns to execute sequential instructions with up to 89.1% task-completion accuracy for single instruction, and 62.7% for complete sequences.Our code is available at https://github.com/clic-lab/scone.

Technical Overview
Task and Notation Let S be the set of all possible world states, X be the set of all natural language instructions, and A be the set of all actions.An instruction x ∈ X of length |x| is a sequence of tokens x 1 , ...x |x| .Executing an action modifies the world state following a transition function T : S × A → S. For example, the ALCHEMY domain includes seven beakers that contain colored liquids.The world state defines the content of each beaker.We treat each beaker as a stack.The actions are POP N and PUSH N C, where 1 ≤ N ≤ 7 is the beaker number and C is one of six colors.There are a total of 50 actions, including the STOP action.Section 6 describes the domains in detail.
Given a start state s 1 and a sequence of instructions x1 , . . ., xn , our goal is to generate the sequence of actions specified by the instructions starting from s 1 .We treat the execution of a sequence of instructions as executing each instruction in turn.The execution ē of an instruction xi starting at a state s 1 and given the history of the instruction sequence x1 , . . ., xi−1 is a sequence of state-action pairs ē = (s 1 , a 1 ), ..., (s m , a m ) , where a k ∈ A, s k+1 = T (s k , a k ).The final action a m is the special action STOP, which indicates the execution has terminated.The final state is then s m , as T (s k , STOP) = s k .Executing a sequence of instructions in order generates a sequence ē1 , ..., ēn , where ēi is the execution of instruction xi .When referring to states and actions in an indexed execution ēi , the k-th state and action are s i,k and a i,k .We execute instructions one after the other: ē1 starts at the interaction initial state s 1 and s i+1,1 = s i,|ē i | , where s i+1,1 is the start state of ēi+1 and s i,|ē i | is the final state of ēi .
Model We model the agent with a neural network policy (Section 4).At step k of executing the i-th instruction, the model input is the current instruction xi , the previous instructions x1 , . . ., xi−1 , the world state s 1 at the beginning of executing xi , and the current state s k .The model predicts the next action a k to execute.If a k = STOP, we switch to the next instruction, or if at the end of the instruction sequence, terminate.Otherwise, we update the state to s k+1 = T (s k , a k ).The model uses attention to process the different inputs and a recurrent neural network (RNN) decoder to generate actions (Bahdanau et al., 2015).
Learning We assume access to a set of N instruction sequences, where each instruction in each sequence is paired with its start and goal states.During training, we create an example for each instruction.Formally, the training set is {(x i )} N,n (j)  j=1,i=1 , where x(j) i is an instruction, s i,1 is a start state, x(j) 1 , . . ., x(j) i−1 is the instruction history, g (j) i is the goal state, and n (j) is the length of the j-th instruction sequence.This training data contains no evidence about the actions and intermediate states required to execute each instruction. 1We use a learning method that maximizes the expected immediate reward for a given state (Section 5).The reward accounts for task-completion and distance to the goal via potential-based reward shaping.
Evaluation We evaluate exact task completion for sequences of instructions on a test set {(s x(j) 1 , . . ., x(j) n j , g (j) )} N j=1 , where g (j) is the oracle goal state of executing instructions x(j) 1 , . . .,x n j in order starting from s (j) 1 .We also evaluate single-instruction task completion using per-instruction annotated start and goal states.

Related Work
Executing instructions has been studied using the SAIL corpus (MacMahon et al., 2006) with focus on navigation using high-level logical representations (Chen and Mooney, 2011;Chen, 2012;Artzi and Zettlemoyer, 2013;Artzi et al., 2014) and lowlevel actions (Mei et al., 2016).While SAIL includes sequences of instructions, the data demonstrates limited discourse phenomena, and instructions are often processed in isolation.Approaches that consider as input the entire sequence focused on segmentation (Andreas and Klein, 2015).Recently, other navigation tasks were proposed with focus on single instructions (Anderson et al., 2018;Janner et al., 2018).We focus on sequences of environment manipulation instructions and modeling contextual cues from both the changing environment and instruction history.Manipulation using single-sentence instructions has been stud-ied using the Blocks domain (Bisk et al., 2016(Bisk et al., , 2018;;Misra et al., 2017;Tan and Bansal, 2018).Our work is related to the work of Branavan et al. (2009) and Vogel and Jurafsky (2010).While both study executing sequences of instructions, similar to SAIL, the data includes limited discourse dependencies.In addition, both learn with rewards computed from surface-form similarity between text in the environment and the instruction.We do not rely on such similarities, but instead use a state distance metric.
Language understanding in interactive scenarios that include multiple turns has been studied with focus on dialogue for querying database systems using the ATIS corpus (Hemphill et al., 1990;Dahl et al., 1994).Tür et al. (2010) surveys work on ATIS.Miller et al. (1996), Zettlemoyer andCollins (2009), andSuhr et al. (2018) modeled context dependence in ATIS for generating formal representations.In contrast, we focus on environments that change during execution and directly generating environment actions, a scenario that is more related to robotic agents than database query.
The SCONE corpus (Long et al., 2016) was designed to reflect a broad set of discourse context-dependence phenomena.It was studied extensively using logical meaning representations (Long et al., 2016;Guu et al., 2017;Fried et al., 2018).In contrast, we are interested in directly generating actions that modify the environment.This requires generating lower-level actions and learning procedures that are otherwise hardcoded in the logic (e.g., mixing action in Figure 1).Except for Fried et al. (2018), previous work on SCONE assumes access only to the initial and final states during training.This form of supervision does not require operating the agent manually to acquire the correct sequence of actions, a difficult task in robotic agents with complex control.Goal state supervision has been studied for instructional language (e.g., Branavan et al., 2009;Artzi and Zettlemoyer, 2013;Bisk et al., 2016), and more extensively in question answering when learning with answer annotations only (e.g., Clarke et al., 2010;Liang et al., 2011;Kwiatkowski et al., 2013;Berant et al., 2013;Berant andLiang, 2014, 2015;Liang et al., 2017).

Model
We map sequences of instructions x1 , . . ., xn to actions by executing the instructions in or- r 1 j + 3 e C 7 + q U H 8 G 3 e a s y o = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " p S K e R C 6 K r a b k R j 9 Z F y 6 P 3 r 1 j + 3 e C 7 + q U H 8 G 3 e a s y o = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " p S K e R C 6 K r a b k R j 9 Z F y 6 P 3

Throw out first beaker
It turns brown Pour sixth beaker into last one a 0 1 m M r n o h u m h f e 5 1 b y f d l 0 c k k s X c 3 p J s t 3 B / i A 5 2 e s f J P M I 1 9 g H t s M + s Z R 9 Y Q f s i B 2 z j A n 2 g 9 2 y O 3 Y f P U R P 0 c / o 1 3 P p U j T v 2 W b / I P r 9 B / v P r y 0 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " y H a 0 1 m M r n o h u m h f e 5 1 b y f d l 0 c k k s X c 3 p J s t 3 B / i A 5 2 e s f J P M I 1 9 g H t s M + s Z R 9 Y Q f s i B 2 z j A n 2 g 9 2 y O 3 Y f P U R P 0 c / o 1 3 P p U j T v 2 W b / I P r 9 B / v P r y 0 = < / l a t e x i t > MLP < l a t e x i t s h a 1 _ b a s e 6 4 = " F G r p B t 8 o X s k 5 g c k g 4 5 J W e k S z h 5 I D / I I / k Z / A q e g 9 / B n 9 f W h W A 6 s 0 X + Q / D 3 B T D + r z Y = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Z Q r p B t 8 o X s k 5 g c k g 4 5 J W e k S z h 5 I D / I I / k Z / A q e g 9 / B n 9 f W h W A 6 s 0 X + Q / D 3 B T D + r z Y = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Z Q r p B t 8 o X s k 5 g c k g 4 5 J W e k S z h 5 I D / I I / k Z / A q e g 9 / B n 9 f W h W A 6 s 0 X + Q / D 3 B T D + r z Y = < / l a t e x i t > Current state s 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " w P 0 L m f n I O 0 h D C u I K B q T z 5 N 5 6 N d 0 k m T t A g n T + S Z v J B X 5 8 3 5 c D 6 d r 1 H p j D P u 2 S J / 4 H z / A G + O q t s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " w P 0 L m f n I O 0 h D C u I K B q T z 5 N 5 6 N d 0 k m T t A g n T + S Z v J B X 5 8 3 5 c D 6 d r 1 H p j D P u 2 S J / 4 H z / A G + O q t s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " w P 0 L m f n I O 0 h D C u I K B q T z 5 N 5 6 N d 0 o G q J n 9 I J e n T f n 3 f l w P u e t a 8 5 i 5 g j 9 g f P o G q J n 9 I J e n T f n 3 f l w P u e t a 8 5 i 5 g j 9 g f P o G q J n 9 I J e n T f n 3 f l w P u e t a 8 5 i 5 g j 9 g f P e 0 Y v z 6 r w 5 7 8 7 n Z L T m T H d 2 0 C 8 4 X 9 8 / k 6 e 0 Y v z 6 r w 5 7 8 7 n Z L T m T H d 2 0 C 8 4 X 9 8 / k 6 e 0 Y v z 6 r w 5 7 8 7 n Z L T m T H d 2 0 C 8 4 X 9 8 / k 6 H j + g J P X s v 3 q v 3 7 n 1 8 j 4 5 5 w 5 1 V 9 A v e 5 x f U 6 q g N < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " H j + g J P X s v 3 q v 3 7 n 1 8 j 4 5 5 w 5 1 V 9 A v e 5 x f U 6 q g N < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 Z 7 k 4 + T 5 N P 7 8 T T p I 9 w g r 8 h r 8 p a k 5 A O Z k i N y T G a E k 5 / k m t y Q X 4 M / 0 T D a j L Z u W 6 N B P 7 N D 7 l T 0 8 i + V N L a a < / l a t e x i t > Instruction history context z p 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 / u Q c 8 u Z l 9 P j h e J u 1 l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 / u Q c 8 u Z l 9 P j h e J u 1 A h s 6 I = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 / u Q c 8 u Z l 9 P j h e J u 1 Figure 2: Illustration of the model architecture while generating the third action a 3 in the third utterance x3 from Figure 1.Context vectors computed using attention are highlighted in blue.The model takes as input vector encodings from the current and previous instructions x1 , x2 , and x3 , the initial state s 1 , the current state s 3 , and the previous action a 2 .Instruction encodings are computed with a bidirectional RNN.We attend over the previous and current instructions and the initial and current states.We use an MLP to select the next action.
der.The model generates an execution ē = (s 1 , a 1 ), . . ., (s m i , a m i ) for each instruction xi .The agent context, the information available to the agent at step k, is sk = (x i , x1 , . . ., xi−1 , s k , ē[: k]), where ē[: k] is the execution up until but not including step k.In contrast to the world state, the agent context also includes instructions and the execution so far.The agent policy π θ (s k , a) is modeled as a probabilistic neural network parametrized by θ, where sk is the agent context at step k and a is an action.To generate executions, we generate one action at a time, execute the action, and observe the new world state.In step k of executing the i-th instruction, the network inputs are the current utterance xi , the previous instructions x1 , . . ., xi−1 , the initial state s 1 at beginning of executing xi , and the current state s k .When executing a sequence of instructions, the initial state s 1 is either the state at the beginning of executing the sequence or the final state of the execution of the previous instruction.Figure 2 illustrates our architecture.
We generate continuous vector representations for all inputs.Each input is represented as a set of vectors that are then processed with an attention function to generate a single vector representation (Luong et al., 2015).We assume access to a domain-specific encoding function ENC(s) that, given a state s, generates a set of vectors S representing the objects in the state.For example, in the ALCHEMY domain, a vector is generated for each beaker using an RNN.Section 6 describes the different domains and their encoding functions.
We use a single bidirectional RNN with a long short-term memory (LSTM; Hochreiter and Schmidhuber, 1997) recurrence to encode the instructions.All instructions x1 ,. . .,x i are encoded with a single RNN by concatenating them to x .We use two delimiter tokens: one separates previous instructions, and the other separates the previous instructions from the current one.The forward LSTM RNN hidden states are computed as:2 where φ I is a learned word embedding function and is the forward LSTM recurrence function.We use a similar computation to compute the backward hidden states ← − h j .For each token x j in x , a vector representation h j = − → h j ; ← − h j is computed.We then create two sets of vectors, one for all the vectors of the current instruction and one for the previous instructions: where J is the index in x where the current instruction xi begins.Separating the vectors to two sets will allows computing separate attention on the current instruction and previous ones.
To compute each input representation during decoding, we use a bi-linear attention function (Luong et al., 2015).Given a set of vectors H, a query vector h q , and a weight matrix W, the attention function ATTEND(H, h q , W) computes a context vector z: We use a decoder to generate actions.At each time step k, we compute an input representation using the attention function, update the decoder state, and compute the next action to execute.Attention is first computed over the vectors of the current instruction, which is then used to attend over the other inputs.We compute the context vectors z c k and z p k for the current instruction and previous instructions: where h d k−1 is the decoder hidden state for step k − 1, and X c and X p are the sets of vector representations for the current instruction and previous instructions.Two attention heads are used over both the initial and current states.This allows the model to attend to more than one location in a state at once, for example when transferring items from one beaker to another in ALCHEMY.The current state is computed by the transition function , where s k−1 and a k−1 are the state and action at step k − 1.The context vectors for the initial state s 1 and the current state s k are: where all W * , * are learned weight matrices.
We concatenate all computed context vectors with an embedding of the previous action a k−1 to create the input for the decoder: where φ O is a learned action embedding function and LSTM D is the LSTM decoder recurrence.
Given the decoder state h d k , the next action a k is predicted with a multi-layer perceptron (MLP).The actions in our domains decompose to an action type and at most two arguments. 3For example, the action PUSH 1 B in ALCHEMY has the type PUSH and two arguments: a beaker number and a color.Section 6 describes the actions of each domain.The probability of an action is: 3 We use a NULL argument for unused arguments.
where a T , a 1 , and a 2 are an action type, first argument, and second argument.If the predicted action is STOP, the execution is complete.Otherwise, we execute the action a k to generate the next state s k+1 , and update the agent context sk to sk+1 by appending the pair (s k , a k ) to the execution ē and replacing the current state with s k+1 .
The model parameters θ include: the embedding functions φ I and φ O ; the recurrence parameters for

Learning
We estimate the policy parameters θ using an exploration-based learning algorithm that maximizes the immediate expected reward.Broadly speaking, during learning, we observe the agent behavior given the current policy, and for each visited state compute the expected immediate reward by observing rewards for all actions.We assume access to a set of training examples {(x i )} N,n (j) j=1,i=1 , where each instruction x(j) i is paired with a start state s (j) i,1 , the previous instructions in the sequence x(j) 1 , . . ., x(j) i−1 , and a goal state g (j) i .
Reward The reward R (j) i : S × S × A → R is defined for each example j and instruction i: where s is a source state, a is an action, and s is a target state.4P (j) i (s, a, s ) is a problem reward and φ i (s) is a shaping term.The problem reward P (j) i (s, a, s ) is positive for stopping at the goal g (j) i and negative for stopping in an incorrect Algorithm 1 SESTRA: Single-step Reward Observation.
j=1,i=1 , learning rate µ, entropy regularization coefficient λ, episode limit horizon M .Definitions: π θ is a policy parameterized by θ, BEG is a special action to use for the first decoder step, and STOP indicates end of an execution.T (s, a) is the state transition function, H is an entropy function, R i (s, a, s ) is the reward function for example j and instruction i, and RMSPROP divides each weight by a running average of its squared gradient (Tieleman and Hinton, 2012).Output: Parameters θ defining a learned policy π θ .
1: for t = 1, . . ., T, j = 1, . . ., N do 2: » Rollout up to STOP or episode limit.5: while » Sample an action from policy.9: ∆ ← 0 13: for k = 1, . . ., k do 14: » Compute the entropy of π θ (s k , •). 15: for a ∈ A do 17: s ← T (s k , a) 18: » Compute gradient for action a. 19: state or taking an invalid action: where δ is a verbosity penalty.The case s = s indicates that a was invalid in state s, as in this domain, all valid actions except STOP modify the state.We use a potential-based shaping term φ i (s) (Ng et al., 1999), where φ i || computes the edit distance between the state s and the goal, measured over the objects in each state.The shaping term densifies the reward, providing a meaningful signal for learning in nonterminal states.
Objective We maximize the immediate expected reward over all actions and use entropy regularization.The gradient is approximated by sampling an execution ē = (s 1 , a 1 ), . . ., (s k , a k ) using our current policy: where H(π(s k , •) is the entropy term.
Algorithm Algorithm 1 shows the Single-step Reward Observation (SESTRA) learning algorithm.We iterate over the training data T times (line 1).For each example j and turn i, we first perform a rollout by sampling an execution ē from π θ with at most M actions (lines 5-11).If the rollout reaches the horizon without predicting STOP, we set the problem reward P (j) i to −1.0 for the last step.Given the sampled states visited, we compute the entropy (line 15) and observe the immediate reward for all actions (line 19) for each step.Entropy and rewards are used to accumulate the gradient, which is applied to the parameters using RMSPROP (Dauphin et al., 2015) (line 20).
Discussion Observing the rewards for all actions for each visited state addresses an on-policy learning exploration problem.Actions that consistently receive negative reward early during learning will be visited with very low probability later on, and in practice, often not explored at all.Because the network is randomly initialized, these early negative rewards are translated into strong general biases that are not grounded well in the observed context.Our algorithm exposes the agent to such actions later on when they receive positive rewards even though the agent does not explore them during rollout.For example, in ALCHEMY, POP actions are sufficient to complete the first steps of good executions.As a result, early during learning, the agent learns a strong bias against PUSH actions.In practice, the agent then will not explore PUSH actions again.In our algorithm, as the agent learns to roll out the correct POP prefix, it is then exposed to the reward for the first PUSH even though it likely sampled another POP.It then unlearns its bias towards predicting POP.
Our learning algorithm can be viewed as a costsensitive variant of the oracle in DAGGER (Ross et al., 2011), where it provides the rewards for all actions instead of an oracle action.It is also related to Locally Optimal Learning to Search (LOLS; Chang et al., 2015) with two key distinctions: (a) instead of using different roll-in and roll-out policies, we use the model policy; and (b) we branch at each step, instead of once, but do not rollout

Single-step Reward Observations
< l a t e x i t s h a 1 _ b a s e 6 4 = " G / + r q g E Q q u + q L 8 l 7 r e S b 6 H s U u t L z e 2 9 m r O e 8 U t J h k v z s R I 8 e P 1 l b 7 z 6 N n z 1 / 8 X K j t / n q m y t r K 2 A o S l X a i 5 w 7 U N L A E C U q u K g s c J 0 r O M 8 v j x b 6 + R V Y J 0 v z F e c V Z J p P j Z x I w T F Q 4 9 5 s h 7 K Z q 7 g A / 6 7 C h j K E 7 + j P p J k q 2 H U I V c P Y a H e v w o z S e 2 u a 3 n u / w D W 3 R B c G F l e D 8 V M 2 6 5 w P A z c c w s G L g W p d b c F J 6 J Z p R m 3 j O r a T 9 t m j g k l 6 7 m 9 B A M 9 w Y f B 8 n p f v 8 w a S P s k t d k m 7 w l K T k g h + Q z O S F D I s g P 8 o v 8 J n + i b j S I 9 q P 3 t 9 a o 0 / Z s k X 8 q + n Q D m C j F d Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " G / + r q g E Q q u + q L 8 l 7 r e S b 6 H s U u t 4 = " > A A A C h X i c b V F N b x M x E H U W a M P y 0 Z Q e u V i N q L g 0 2 q 0 K h R O V e u H W l h J a K V 5 F X u 8 k s W p 7 V / Z s S 2 T t X + T O / + A K q p N u V U g Z y d L z e 2 9 m r O e 8 U t J h k v z s R I 8 e P 1 l b 7 z 6 N n z 1 / 8 X K j t / n q m y t r K 2 A o S l X a i 5 w 7 U N L A E C U q u K g s c J 0 r O M 8 v j x b 6 + R V Y J 0 v z F e c V Z J p P j Z x I w T F Q 4 9 5 s h 7 K Z q 7 g A / 6 7 C h j K E 7 + j P p J k q 2 H U I V c P Y a H e v w o z S e 2 u a 3 n u / w D W 3 R B c G F l e D 8 V M 2 6 5 w P A z c c w s G L g W p d b c F J 6 J Z p R m 3 j O r a T 9 t m j g k l 6 7 m 9 B A M 9 w Y f B 8 n p f v 8 w a S P s k t d k m 7 w l K T k g h + Q z O S F D I s g P 8 o v 8 J n + i b j S I 9 q P 3 t 9 a o 0 / Z s k X 8 q + n Q D m C j F d Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " G / + r q g E Q q u + q L 8 l 7 r e S b 6 H s U u t 4 = " > A A A C h X i c b V F N b x M x E H U W a M P y 0 Z Q e u V i N q L g 0 2 q 0 K h R O V e u H W l h J a K V 5 F X u 8 k s W p 7 V / Z s S 2 T t X + T O / + A K q p N u V U g Z y d L z e 2 9 m r O e 8 U t J h k v z s R I 8 e P 1 l b 7 z 6 N n z 1 / 8 X K j t / n q m y t r K 2 A o S l X a i 5 w 7 U N L A E C U q u K g s c J 0 r O M 8 v j x b 6 + R V Y J 0 v z F e c V Z J p P j Z x I w T F Q 4 9 5 s h 7 K Z q 7 g A / 6 7 C h j K E 7 + j P p J k q 2 H U I V c P Y a H e v w o z S e 2 u a 3 n u / w D W 3 R B c G F l e D 8 V M 2 6 5 w P A z c c w s G L g W p d b c F J 6 J Z p R m 3 j O r a T 9 t m j g k l 6 7 m 9 B A M 9 w Y f B 8 n p f v 8 w a S P s k t d k m 7 w l K T k g h + Q z O S F D I s g P 8 o v 8 J n + i b j S I 9 q P 3 t 9 a o 0 / Z s k X 8 q + n Q D m C j F d Q = = < / l a t e x i t > Figure 3: Illustration of LOLS (left; Chang et al., 2015) and our learning algorithm (SESTRA, right).LOLS branches a single time, and samples complete rollout for each branch to obtain the trajectory loss.SESTRA uses a complete on-policy rollout and singlestep branching for all actions in each sample state.We count occurrences of coreference between instructions (e.g., he leaves in SCENE) and ellipsis (e.g., then, drain 2 units in ALCHEMY), when the last explicit mention of the referent was 1, 2, 3, or 4 turns in the past.We also report the average number of multi-turn references per interaction (Refs/Ex).
from branched actions since we only optimize the immediate reward.Figure 3 illustrates the comparison.Our summation over immediate rewards for all actions is related the summation of estimated Q-values for all actions in the Mean Actor-Critic algorithm (Asadi et al., 2017).Finally, our approach is related to Misra et al. (2017), who also maximize the immediate reward, but do not observe rewards for all actions for each state.

SCONE Domains and Data
SCONE has three domains: ALCHEMY, SCENE, and TANGRAMS.Each interaction contains five instructions.Table 1 shows data statistics.Table 2 shows discourse reference analysis.State encodings are detailed in the Supplementary Material.
ALCHEMY Each environment in ALCHEMY contains seven numbered beakers, each containing up to four colored chemicals in order.Figure 1 shows an example.Instructions describe pouring chemicals between and out of beakers, and mixing beakers.We treat all beakers as stacks.There are two action types: PUSH and POP.POP takes a beaker index, and removes the top color.PUSH takes a beaker index and a color, and adds the color at the top of the beaker.To encode a state, we encode each beaker with an RNN, and concatenate the last output with the beaker index embedding.
The set of vectors is the state embedding.
SCENE Each environment in SCENE contains ten positions, each containing at most one person defined by a shirt color and an optional hat color.Instructions describe adding or removing people, moving a person to another position, and moving a person's hat to another person.There are four action types: ADD_PERSON, ADD_HAT, REMOVE_PERSON, and REMOVE_HAT.ADD_PERSON and ADD_HAT take a position to place the person or hat and the color of the person's shirt or hat.REMOVE_PERSON and REMOVE_HAT take the position to remove a person or hat from.To encode a state, we use a bidirectional RNN over the ordered positions.The input for each position is a concatenation of the color embeddings for the person and hat.The set of RNN hidden states is the state embedding.
TANGRAMS Each environment in TANGRAMS is a list containing at most five unique objects.Instructions describe removing or inserting an object into a position in the list, or swapping the positions of two items.There are two action types: INSERT and REMOVE.INSERT takes the position to insert an object, and the object identifier.REMOVE takes an object position.We embed each object by concatenating embeddings for its type and position.The resulting set is the state embedding.

Experimental Setup
Evaluation Following Long et al. (2016), we evaluate task completion accuracy using exact match between the final state and the annotated goal state.We report accuracy for complete interactions (5utts), the first three utterances of each interaction (3utts), and single instructions (Inst).For single instructions, execution starts from the annotated start state of the instruction.
Systems We report performance of ablations and two baseline systems: POLICYGRADIENT: policy gradient with cumulative episodic reward without a baseline, and CONTEXTUALBANDIT: the contextual bandit approach of Misra et al. (2017).Both systems use the reward with the shaping term and our model.We also report supervised learning results (SUPERVISED) by heuristically generating correct executions and computing maximum-likelihood estimate using contextaction demonstration pairs.Only the supervised approach uses the heuristically generated labels.Although the results are not comparable, we also report the performance of previous approaches to SCONE.All three approaches generate logical representations based on lambda calculus.In contrast to our approach, this requires an ontology of hand built symbols and rules to evaluate the logical forms.Fried et al. (2018) uses supervised learning with annotated logical forms.
Training Details For test results, we run each experiment five times and report results for the model with best validation interaction accuracy.For ablations, we do the same with three experiments.We use a batch size of 20.We stop training using a validation set sampled from the training data.We hold the validation set constant for each domain for all experiments.We use patience over the average reward, and select the best model using interaction-level (5utts) validation accuracy.We tune λ, δ, and M on the development set.The selected values and other implementation details are described in the Supplementary Material.

Results
Table 3 shows test results.Our approach significantly outperforms POLICYGRADIENT and CON-TEXTUALBANDIT, both of which suffer due to biases learned early during learning, hindering later exploration.This problem does not appear in TANGRAMS, where no action type is dominant at the beginning of executions, and all methods perform well.POLICYGRADIENT completely fails to learn ALCHEMY and SCENE due to observing only negative total rewards early during learning.
Using a baseline, for example with an actor-critic method, will potentially close the gap to CONTEX-TUALBANDIT.However, it is unlikely to address the on-policy exploration problem.
Table 4 shows development results, including model ablation studies.Removing previous instructions (-previous instructions) or both states (-current and initial state) reduces performance across all domains.Removing only the initial state (-initial state) or the current state (-current state) shows mixed results across the domains.Providing access to both initial and current states increases performance for ALCHEMY, but reduces performance on the other domains.We hypothesize that this is due to the increase in the number of parameters outweighing what is relatively marginal information for these domains.In our development and test results we use a single architecture across the three domains, the full approach, which has the highest interactive-level accuracy when averaged across the three domains (62.7 5utts).We also report mean and standard deviation for our approach over five trials.We observe exceptionally high variance in performance on SCENE, where some experiments fail to learn and training performance remains exceptionally low (Figure 4).This highlights the sensitivity of the model to the random effects of initialization, dropout, and ordering of training examples.
We analyze the instruction-level errors made by our best models when the agent is provided the correct initial state for the instruction.We study fifty examples in each domain to identify the type of failures.Table 5 shows the counts of major error categories.We consider multiple reference resolution errors.State reference errors indicate a failure to resolve a reference to the world state.For example, in ALCHEMY, the phrase leftmost red beaker specifies a beaker in the environment.If the model picked the correct action, but the wrong beaker, we count it as a state reference.We distinguish between multi-turn reference errors that should be feasible, and these that that are impossible to solve without access to states before executing previous utterances, which are not provided to our model.For example, in TANGRAMS, the instruction put it back in the same place refers to a previouslyremoved item.Because the agent only has access to the world state after following this instruction, it does not observe what kind of item was previously removed, and cannot identify the item to add.We  also find a significant number of errors due to ambiguous or incorrect instructions.For example, the SCENE instruction person in green appears on the right end is ambiguous.In the annotated goal, it is interpreted as referring to a person already in the environment, who moves to the 10th position.However, it can also be interpreted as a new person in green appearing in the 10th position.
We also study performance with respect to multi-turn coreference by observing whether the model was able to identify the correct referent for each occurrence included in the analysis in Table 2.The models were able to correctly resolve 92.3%, 88.7%, and 76.0% of references in ALCHEMY, SCENE, and TANGRAMS respectively.
Finally, we include attention visualization for examples from the three domains in the Supplementary Material.

Discussion
We propose a model to reason about contextdependent instructional language that display strong dependencies both on the history of the interaction and the state of the world.Future modeling work may include using intermediate world states from previous turns in the interaction, which is required for some of the most complex references in the data.We propose to train our model using SESTRA, a learning algorithm that takes advantage of single-step reward observations to overcome learned biases in on-policy learning.Our learning approach requires additional reward observations in comparison to conventional reinforcement learning.However, it is particularly suitable to recovering from biases acquired early during learning, for example due to biased action spaces, which is likely to lead to incorrect blame assignment in neural network policies.When the domain and model are less susceptible to such biases, the benefit of the additional reward observations is less pronounced.One possible direction for future work is to use an estimator to predict rewards for all actions, rather than observing them.

A Domain-Specific Implementation Details
For each domain ALCHEMY, SCENE, and TAN- We compute a sequence of forward hidden states: The backward RNN is equivalent.ENC returns the set {h i } N i=1 , where

B Data Analysis
We analyze SCONE to identify the frequency of various discourse phenomena in the three domains, including explicit coreference and ellipsis, which is implicit reference to previous entities.We observe references to previous objects (e.g., beakers in ALCHEMY), actions, locations (e.g., positions in SCENE), and world states.We analyze thirty development set interactions for each domain for presence of these references.We define the age of each referent as the number of turns since it was last explicitly mentioned.This illustrates the extent to which this dataset challenges models for context-dependent reasoning.
ALCHEMY Table 6 shows phenomena counts in ALCHEMY.Each interaction contains on average 1.4 references dependent on the interaction history.Each non-first utterance contains on average 0.3 references.The most common form of reference is explicit coreference (Coref.) to previouslymentioned beakers, for example mix it.Other references are to previous actions, referring to the action only (e.g., same with the last beaker) or the action as well as the arguments (e.g., same for one more unit, referring to draining one unit from a previously-used beaker).Ellipsis occurred four times in the thirty evaluated interactions, for example then, drain 1 unit, implicitly referring to a specific beaker to drain from.7 shows phenomena counts in SCENE.Each interaction contains on average 2.4 references dependent on the interaction his-  tory.Each non-first utterance contains on average 0.6 references.The most common form of reference is explicit coreference (Coref.) to previouslymentioned people, for example he moves to the left end.Coreference also occurs on hat colors (e.g., he gives it back), actions along with their arguments (e.g., they did it again referring to trading specific hats), and positions (e.g., he moves back).

SCENE Table
TANGRAMS Table 8 shows phenomena counts in TANGRAMS.Each interaction contains on average 1.7 references dependent on the interaction history.Each non-first utterance contains on average 0.4 references.The most common form of reference is on objects via reference to a previous step, for example put the item you just removed in the second spot.This requires recalling actions taken in previous turns, including the actions' arguments and the previous world state.Coreference (Coref.)also occurs for positions (e.g., ...where the last deleted figure was), actions (e.g., do the same with the second to last figure and one before it), and actions along with the previously-used arguments (e.g., repeat the first step).Ellipsis occurs for positions (e.g., add it again, implicitly referring to the item's previous location) and actions along with their arguments (e.g., undo the last step).

C Attention Analysis
Figure 5 shows attention distributions for a handpicked example in ALCHEMY.We show the attention probabilities (α in Section 4) for the current and previous utterances, initial state, and current state throughout execution.In this example, the previous-instruction attention puts most of the weight on brown one during generation, which is the referent of it in the current instruction.The initial and current state attentions are placed heavily on the beaker being manipulated.However, for randomly selected examples, we observe that the attention distribution does not always correspond to intuitions about what should be attended on.(Glorot and Bengio, 2010), where M and N are the matrix dimensionality.All RNNs are single-layer LSTMs.For the main model, both the instruction encoder and action sequence decoder use a hidden size of 100 in each direction.The action sequence decoder is initialized by first setting the hidden state and cell memory to zero-vectors, and passing in a zero-vector to update the states, after which attention is computed for the first time.For ALCHEMY, the world state encoder has a hidden size of 20.For SCENE, the world state encoder has a hidden size of 5.
Training We apply dropout in three places: (a) in each attention computation after multiplying by W; (b) after computing h k , the input to each decoder step; and (c) for all attention keys except for the current utterance.For POLICYGRADI-ENT, CONTEXTUALBANDIT, and our approach, we optimize parameters using RMSPROP (Tieleman and Hinton, 2012).For supervised learning, we use ADAM (Kingma and Ba, 2014) for optimization.We use a learning rate of 0.001 for all experiments.Our validation set is a held-out subset containing 7.0% of the training data.We stop training by observing the instruction-level reward on the validation set.We use patience for early stopping.We reset patience to 50 • 1.005 x the xth time the reward has improved on the validation set, decrease by one each epoch reward does not improve, and stop when patience runs out.Regardless of patience, we terminate training after 200 epochs.We tune λ, δ, and M on the devel- T 5 P c g e v D w 0 f D x x p P 4 6 e a z r e e j 7 R c n z j S W w 4 w b a e x Z w R x T 5 P c g e v D w 0 f D x x p P 4 6 e a z r e e j 7 R c n z j S W w 4 w b a e x Z w R x T 5 P c g e v D w 0 f D x x p P 4 6 e a z r e e j 7 R c n z j S W w 4 w b a e x Z w R x , and b d ; and the domain dependent parameters, including the parameters of the encoding function ENC and the action type, first argument, and second argument weights b a T , b a 1 , and b a 2 .

Figure 4 :
Figure 4: Instruction-level training accuracy per epoch when training five models on SCENE, demonstrating the effect of randomization in the learning method.Three of five experiments fail to learn effective models.The red and blue learning trajectories are overlapping.
represents a position.TANGRAMS The world state in TANGRAMS is a list of positions T = p 1 , p 2 , ..., p n of a variable length n.Each position contains one of five unique shapes.The distance function between states is the edit distance between the lists, with a cost of two for substitutions.The action space of TANGRAMS includes two action types, INSERT and REMOVE.The INSERT action takes two arguments: a position N ∈ {1, • • • , M }, where M is the maximum length of a state in the TANGRAMS dataset, and a shape type T, which is one the five possible shapes.The REMOVE action takes a single argument: a position N.The transition function T is defined by two cases: (a) T (s, a = INSERT N T) returns a state where the shape T is in position N and all objects to its right shifted by one position if T is not already in the state, otherwise the action is invalid and s is returned; and (b) T (s, a = REMOVE N) returns a state where the object in position N was removed if N ≤ n, otherwise the action is invalid and s is returned.The state encoding function ENC is parameterized by (a) h N U LL , a vector used when n = 0; (b) φ s , an embedding function

Figure 5 :
Figure 5: Example of the attention distributions for executing the instruction It turns completely brown in ALCHEMY.This is the fifth instruction in the interaction.The correct action sequence mixes the chemicals in the sixth beaker by removing the three units and re-adding three brown units.Our model correctly predicts this sequence.We show the different attention distributions when generating this sequence of actions.Clockwise starting from the top left: (a) attention over the current instruction; (b) two attention heads over the initial state; (c) two attention heads over the current world state, which changes following each action; and (d) the attention over the previous instructions in the interaction.

Figures 6
Figures 6, 7, and 8 show examples of attention distributions for three random instructions in the development sets of the three domains where the action sequence was predicted correctly.

Figure 6 :
Figure 6: Example of attention for a randomly selected instruction from the development set for SCENE.The instruction A person with a blue shirt appears to the left of him is the second in the interaction, following the instruction The person with a red shirt and a blue hat moves to the right end.The correct action sequence consists of a single action, ADD_PERSON 9 B, where a person wearing a blue shirt appears in position 9, to the left of the person in the red shirt.Our model predicts this action correctly.We show the different attention distributions when generating this sequence of a single action.From top to bottom: (a) attention over the current instruction; (b) attention over the previous instruction; and (c) attention over the world state.As the sequence contains a single action only, the current and initial world states are the same, and their distributions are shown together.There are two attention heads over both the initial (top two rows) and current (bottom two rows) world states.

Figure 7 :Figure 8 :
Figure 7: Example of attention for a randomly selected instruction from the development set for ALCHEMY.The instruction executed is Pour green beaker into orange one, the fifth instruction in the sequence.We show the different attention distributions when generating the correct action sequence, which removes green items from the sixth beaker and adds the same number of green items to the beaker containing orange.Clockwise starting from the top left: (a) attention on the current instruction; (b) the two attention heads over the initial state; (c) the two attention heads over the current state as it changes during execution; and (d) attention over previous instructions.

Table 2 :
Counts of discourse phenomena in SCONE from 30 randomly selected development interactions for each domain.

Table 4 :
Development results, including model ablations.We also report mean µ and standard deviation σ for all metrics for our approach across five experiments.We bold the best performing variations of our model.

Table 5 :
Common error counts in the three domains.
GRAMS, we describe the world state representation, state distance function, transition function, and the state encoder.For all states s, s = T (s, STOP).ALCHEMY The world state in ALCHEMY is a sequence of beakers b1 , b2 , ..., bN of fixed length N = 7.Each beaker bi = c i,1 , c i,2 , ...c i,| bi | is a variable length sequence containing chemical units c, each one of six possible colors.The distance between two world states is the sum over distances for each corresponding beaker pair.The distance between two beakers is the edit distance of the list of chemical units in each.The action space of ALCHEMY includes two action types, POP and PUSH.The POP action takes one argu- ment: N ∈ {1, . . ., N } denoting the beaker to pop a chemical unit from.The PUSH action takes two arguments: N and C, one of six colors.The transition function T is defined by two cases: (a) T (s, a = PUSH N C) will return a state where C is added to the beaker with index N; and (b) T (s, a = POP N ) will remove the top element from the beaker with index N, or if the beaker with index N is empty, the input state s is returned.
represents a beaker.SCENE The world state in SCENE is a sequence of positions S = p 1 , p 2 , ..., p N of fixed length N = 10.Each position is a tuple p i = s i , h i , where s i is a shirt color h i is a hat color.There are six colors, and a special NULL marker indicating no shirt or hat is present.The distance between two world states is the sum over positions of the number of steps required to modify two corresponding positions to be the same given the domain actions space.The action space of SCENE includes four action types: APPEAR_PERSON, APPEAR_HAT, REMOVE_PERSON, and REMOVE_HAT.APPEAR_PERSON and APPEAR_HAT take two arguments: a position index N and a color C. REMOVE_PERSON and REMOVE_HAT take one argument: a position index N.The transition function T is defined by four cases: (a) T (s, a = APPEAR_PERSON N C) returns a state where position N contains shirt color C if the shirt color in position N is NULL, otherwise the action is invalid and the input state s is returned; (b) T (s, a = APPEAR_HAT N C) is defined analogously to APPEAR_PERSON; (c) T (s, a = REMOVE_PERSON N) returns a state where the shirt color at position N is set to NULL if there is a color at position N, otherwise the action is invalid and the input state s is returned; and (d) T (s, a = REMOVE_HAT N) is defined analogously to REMOVE_PERSON.The state encoding function ENC is parameterized by (a) φ c , an embedding function for shirt and hat colors; (b) φ p , a positional embedding for each position in the scene; and (c) LSTM S , a bidirectional RNN over all positions in order.Each position is embedded using a function φ (

Table 6 :
Count of phenomena in ALCHEMY.

Table 7 :
Count of phenomena in SCENE.for the shapes; and (c) φ p , a positional embedding of the position i.ENC returns a set {h i } n i=1 , whereh i = [φ p (i); φ s (p i )]is the position encoding, or it returns {h N U LL } if the state contains no objects.

Table 8 :
Count of phenomena in TANGRAMS.