Environment-Driven Lexicon Induction for High-Level Instructions

We focus on the task of interpreting complex natural language instructions to a robot, in which we must ground high-level commands such as microwave the cup to low-level actions such as grasping. Previous approaches that learn a lexicon during training have inadequate coverage at test time, and pure search strategies cannot handle the exponential search space. We propose a new hybrid approach that leverages the environment to induce new lexical entries at test time, even for new verbs. Our semantic parsing model jointly reasons about the text, logical forms, and environment over multi-stage instruction sequences. We introduce a new dataset and show that our approach is able to successfully ground new verbs such as distribute, mix, arrange to complex logical forms, each containing up to four predicates.

As described in the main paper, we collected a dataset D = (x (n) , e (n) , a (n) , π (n) ) 500 n=1 .Environment Complexity.Our environments are 3D scenarios consisting of complex objects such as fridge, microwave and television with many states.These objects can be in different spatial relations with respect to other objects.For example, "bag of chips" can be found behind the television.Figure 1 shows some sample environments from our dataset.For example, an object of category television consists of 6 channels, volume level and power status.An object can have different values of states in different environment and different environment consists of different set of objects and their placement.For example, television might be powered on in one environment and closed in another, microwave might have an object inside it or not in different environment, etc.
Moreover, there are often more than one object of the same category.For example, our environment typically have two books, five couches, four pillows etc. Objects of the same category can have different appearance.For example, a book can have the cover of a math book or a Guinness book of world record; resulting in complex object descriptions such as in "throw the sticky stuff in the bowl".They can also have the same appearance making people use spatial relations or other means while describing them such as in "get me the cup next to the microwave".This dataset is significantly more challenging compared to the 2D navigation dataset or GUI actions in windows dataset considered earlier.
Task Complexity.In this paper, we consider tasks with high level objective such as clean the room, prepare the room for movie night etc. compared to navigation or simple manipulations tasks involving picking and placing objects.This results in extremely free-form text such as shown below: • "Turn on xbox.Take Far Cry Game CD and put in xbox by pressig eject to open drive.Throw out beer, coke, and sketchy stuff in bowl.Take pillows from shelf and distribute among couches ." • "Boil some water and make some coffee.Find a white bowl.Take ice cream out of the freezer.Put coffee into the white bowl, then put two scoops of ice cream over that .Finally, take the syrup on the counter and drizzle it over the ice cream." • "If anything is disposable and used, put it in the trash bag.• "Make some coffee.Make some eggs on the stove and then put them on a plate and serve the eggs and coffee to me " • "Take Book of Records and place on table with brown book.The TV is already turned off .Throw out open beer and coke.Chips are good." • "Dump the coffee in the mug in the sink, put all perishable items in the refrigerator, put all the dishes, utensils, and pots in the sink." • "Turn TV on with remote and find movie (Lincoln) on with remote and find movie (Lincoln) .Take bag of chips and place on table.Take pillow from shelf and place on a sofa.Throw away beer and soda, and place Book of Records on shelf with brown book."There can be more than one objects of the same category.
• "Mix syrup and water to make a drink.You can get water by rotating the tab near sink.
Use kettle to boil water and mix heated water with instantRamen." we refer the readers to the full dataset for more examples.Noise in the dataset.Our dataset was collected from non-expert users including the action sequences.Therefore, our dataset had considerable noise as is also visible from the examples above.The noise included spelling and grammar errors in the text, text that is asking the robot to do things which it cannot do such as moving the chairs, noise in the action sequences and noise in aligning parts of action sequences and the text segments.
We use a set of rules to remove noise from the dataset, such as removing cyclic patterns in the action sequence.This often happened when users tried to give a demonstration to the robot such as keeping a mug inside the microwave, but made an error and hence repeated the actions.We want to emphasize here that, the average length of 21.5 actions for the action sequences in the dataset was derived after removing this noise.
Out of the 500 points that we collected, we further removed 31 points consisting of action sequences of length less than 2.

Examples of Planning and Simulation
We use a planner and a simulator which allows us to use post-conditions in defining our logical forms.In order to perform planning and simulation-we encode the domain knowledge in STRIPS planning format.This defines preconditions and effect of action on the environment.An example is given below: (:action release :parameters (o) :precondition (grasping robot o) :effect (not (grasping robot o) )) This STRIPS program defines the action release which takes an object o as the argument.The precondition of this action is that the robot must be grasping the object o and the effect is that robot releases the object o.

Mapping Object Descriptions
Given an object description ω and a set of physical objects {o j } m j=1 ; we want to find the correlation ρ(ω, o j ) ∈ [0, 1] of how well does the description ω describes the object o j .When the description is not a pronoun, we take the following approach.We initialize ∀ j ρ(ω, o j ) = 0 and then try the following rules in the given order, stopping after the first match: • category matching: if there exists a set of objects {o j } containing part of the description in its name then we define ∀ j ρ(ω, o j ) = 1.As explained in the paper, we parse conditional expressions into their meaning representations using a set of rules.This was possible and motivated both by the fact that the conditional expressions in our dataset are easy and because meaning representations of conditional expressions are not observed in the dataset (which only contains actions corresponding to frame nodes).We parse conditional expressions using the following deterministic rules.Each word in the text can further be ignored i.e. mapped to .These rules are simple enough to be parsed in a bottom up fashion starting with words.Example: "for 3 minutes" is parsed as: minutes→ time-unit:min 3 time-unit:min→ time(digit:3,time-unit:min) for time(3,time-unit:min)→ for(time(digit:3,time-unit:min)) For "if" condition, which has a true and false branch; we evaluate the condition using the starting environment.In case of a parsing failure, we always return true.

Feature Equation
We use the following features φ(c i , z i−1 , z i , e i ) briefly described in the paper.The logical form is given by z i = ([ν ⇒ (λ v.S, ξ)], ξ i ).Here ξ i , ξ are mappings of the variable v of the parametrized post-condition S. Let v have m variables and ξ(v j ) represents the object in e i , to which the variable v j is mapped using ξ.Further the post-condition f i is given by f i = (λ v.S)(ξ i ): • Language and Logical Form: There are two features of this type: where ρ is the object description correlation score(see paper).For the f LE feature, we also consider the previous clause c i−1 in the computation of max ω ρ(ω, ξ i (v j )).
• Logical Form: We prefer the post-conditions which have high environment priors and are therefore likely to occur again.Let postcondition We capture this property by 4t features where t denotes the maximum number of predicates that we consider simultaneously for creating the probability tables.In our experiments reported in the paper we took t = 2.These features are given below.The notation V i i∈C stands for average of quantity V i given by 1 e prior = P (1) a prior = P (1) ev prior = P (1) av prior = P (1) e prior = P (2) a prior = P (2) av prior = P (2) The prior tables P (t) r (|.) are created using the training data.
• Logical Form and Environment: As explained in the paper, we introduce the anchored mapping ξ to help in dealing with ellipsis.Therefore, we add a feature that maximizes the similarity between the anchored mapping ξ(v j ) of a variable v j and the new mapping ξ i (v j ).This is given by: where ∆ is a similarity score between objects ξ(v j ) and ξ i (v j ).We compute this by taking ∆(o 1 , o 2 ) = 0.5 1{o 1 .category= o 2 .category}+ 0.5 fraction of common states value pairs.• Relationship Features: Given all (ω 1 , ω 2 , r) pairs where ω 1 , ω 2 ∈ c i and r is a spatial relationship between them.The relationship feature is given by: f rel = y i,ω 1 ,ω 2 (ω 1 ,ω 2 ,r) where y i,ω 1 ,ω 2 = 1 if post-condition f i contains a predicate rel(o 1 , o 2 ) where o 1 , o 2 are the objects referred by description ω 1 , ω 2 respectively.• Similarity Feature: This is given by the Jaccard index of all the words in c i and the words in the anchored lexical entry.• Transition Probabilities: Given a logical form z i−1 , we can set priors on the logical form z i .E.g., its unlikely that a logical form with postcondition f i−1 = on(cup 1 , counter 2 ) will be succeeded by logical form with post-condition f i = on(cup 1 , counter 1 ).Further, the logical forms that can occur in the end state (c i is the last frame node) are also restricted.We therefore, define 3 transition probability features to capture this: During inference, we want to generate logical forms z = ( , ξ) for a given lexical entry = [ν ⇒ (λ v.S, ξ )].However the number of such logical forms are exponential in the number of variables in v. Therefore, for practical reasons we only consider the optimum assignment given by arg max ξ φ(z = ( , ξ), • • • ) • θ.Note that we use slightly different notation from the paper, for reasons of brevity.We convert this assignment problem into an optimization problem and then solve it approximately.To do so, we define 0-1 variables y Where m is the number of variables in v and n is the number of objects in the given environment e.Further y ij = 1 iff variable v i maps to the object o j .Using this notation, the features described in Section 5 can be expressed as follows.

Logical Form
The environment prior terms can be easily ex-pressed in a form which is polynomial in y ij .For example, the feature f (2) e-prior for the parametrized post-condition (state v 1 water) ∧ (on v 1 v 2 ) can be expressed as: m r,s=1 P (2) e-prior ((state or water) ∧ (on or os))y1ry1s

Logical Form and Environment
Similarly, the f ee term can be expressed as:

Relationship Feature
For every (ω 1 , ω 2 , r) ∈ c pair; we find the objects o j 1 , o j 2 referred to by these descriptions.Let the post-condition f contain atoms f 1 , f 2 , • • • f l of the type r(v 1 , v 2 ) then for each such predicate, we consider the term y 1j 1 y 2j 2 .

Transition Probabilities
Transition probabilities are expressed similar to environment priors.
Dropping the higher order terms (generally small) and the recall term(to simplify the optimization); we get a quadratic program of the form: The linear constraints Px ≤ q consists of y ij ∈ {0, 1}, j y ij = 1 and semantic constraints based on preconditions as given in the planner.E.g., for the post-condition on(v 1 , v 2 ), the planner preconditions tells that v 1 mus satisfy IsGraspable(v 1 ); we therefore add these semantic constraints as inferred from the planner.
In this form, the assignment problem is nonconvex and does not necessarily admit a unique solution.While this can be solved by standard solvers such as AlgLib library; this optimization is quite slow and hence for practical reasons we drop the B term and solve the remaining linear program using a fast interior point solver after relaxation.The experiments in the paper are reported based on these approximations.

Figure 1 :
Figure 1: Sample of 3D Environments that we consider.Environments consists of several objects, each object can have several states.Different environment have different set of objects with different configuration.There can be more than one objects of the same category.
• containment (metonymy): for every object o j ; if the main noun in ω matches the state-name of a state of o j which has value T rue then we define ρ(ω, o j ) = 1.• wordnet similarity: for every object o j we find ρ(ω, o j ) using a modified Lesk algorithm based on WordNet.If a similarity score greater than 0.85 is found then we return.• domain specific references: We use giza-pp algorithm to learn translation probabilities between text and corresponding action sequences, using the training data.This gives us a probability table T [words,object-name] of words in text and object name in the sequence.We then initialize ρ(ω, o j ) by averaging the value of T [w, o j .name]for every word w in ω.
predicates (or their negations) given by f il .Also let, pm(∧ l f il ) be the parametrized version of the post-condition ∧ l f il created by replacing each unique object by a unique variable.Example, the postcondition on(cup 2 , bowl 3 )∧state(cup 2 , water) is parametrized to on(v 1 , v 2 )∧state(v 1 , water).