Language-Guided Adaptive Perception for Efficient Grounded Communication with Robotic Manipulators in Cluttered Environments

The utility of collaborative manipulators for shared tasks is highly dependent on the speed and accuracy of communication between the human and the robot. The run-time of recently developed probabilistic inference models for situated symbol grounding of natural language instructions depends on the complexity of the representation of the environment in which they reason. As we move towards more complex bi-directional interactions, tasks, and environments, we need intelligent perception models that can selectively infer precise pose, semantics, and affordances of the objects when inferring exhaustively detailed world models is inefficient and prohibits real-time interaction with these robots. In this paper we propose a model of language and perception for the problem of adapting the configuration of the robot perception pipeline for tasks where constructing exhaustively detailed models of the environment is inefficient and inconsequential for symbol grounding. We present experimental results from a synthetic corpus of natural language instructions for robot manipulation in example environments. The results demonstrate that by adapting perception we get significant gains in terms of run-time for perception and situated symbol grounding of the language instructions without a loss in the accuracy of the latter.


INTRODUCTION
Perception is a critical component of an intelligence architecture that converts raw sensor observations to a suitable representation for the task that the robot is to perform. Models of environments vary significantly depending on the application. For example, a robotic manipulator may need to model the objects in its environment with their six degree-of-freedom pose for grasping and dexterous manipulation tasks, whereas a self-driving car may need to model the dynamics of the environment in addition to domain-specific semantics such as stop signs, sidewalks and pedestrians etc. to safely navigate through the environment.
The ability of robots to perform complex tasks is linked to the richness of the robot's world model. As inferring exhaustively detailed world representations is impractical, it is common to infer representations which are highly specific to the task that the robot is to perform. However, in collaborative domains as we move towards more complex bi-directional interactions, manipulation tasks, and the environments, it becomes unclear how to best represent the environment in order to facilitate planning and reasoning for a wide distribution of tasks. As shown in the Figure 1, modeling the affordance between the chips can and its lid would be unnecessary for the task of picking up the mustard sauce bottle and vice versa. Inferring exhaustively detailed models of all of the objects in the environment is computationally expensive and inconsequential for the individual tasks, and inhibits real-time interaction with these collaborative robots.
The utility of collaborative manipulators is also highly dependent on the speed and accuracy of communication between the human operator and the robot. Natural language interfaces provide intuitive and muti-resolution means to interact with the robots in shared realms. In this work, we propose learning a model of language and perception that can adapt the configurations of the perception pipeline according to the task in order to infer representations that are necessary and suffi- Figure 1: On the left is an image showing the Baxter Research Robot in a cluttered tabletop environment in the context of collaborative human-robot tasks. A perception system that does not use the context of the instruction when interpreting the observations would inefficiently construct detailed world model that is only partially utilized by the symbol grounding algorithm. On the right are the adaptively inferred representations using our proposed language perception model for the instructions, "pick up the leftmost blue gear" and "pick up the largest red object" respectively. cient to facilitate planning and grounding for the intended task. e.g. the top-right image in the Figure 1 shows the adaptively inferred world model pertaining to the instruction "pick up the leftmost blue gear" which is different than the one inferred for the instruction "pick up the largest red object".

BACKGROUND
The algorithms and models presented in this paper span the topics that include robot perception and natural language understanding for human-robot interaction. Perception is a central problem in the the field of situated robotics. Consequently, a plenty of research has focused on developing representations that can faciliate planning and reasoning for highly specific situated tasks. These representations vary significantly depending on the application, from two-dimensional costmaps (Elfes, 1987), volumetric 3D voxel representations (Hornung et al., 2013(Hornung et al., , 2010, primitive shape based object approximations (Miller et al., 2003;Huebner and Kragic, 2008) to more rich representations that model high level semantic properties (Galindo et al., 2005;Pronobis and Jensfelt, 2012), 6 DOF pose of the objects of interest (Hudson et al., 2012) or affordances between objects (Daniele et al., 2017). Since inferring exhaustively detailed world models is impractical, one solution is to design perception pipelines that infer task relevant world models (Eppner et al., 2016;Fallon et al., 2014). Inferring efficient models that can support reason-ing and planning for a wide distribution of tasks remains an open research question.
Natural language interfaces provides intutive and multi-resolution means to interact with the collaborative robots.
Contemporary models (Tellex et al., 2011;Boularias et al., 2015;Matuszek et al., 2013) frame the problem of language understanding as a symbol grounding problem (Harnad, 1990). Specifically, of inferring correspondences between the linguistic constituents of the instruction and the symbols that represent perceived entities in the robot's environment such as objects and regions or desired actions that the robot can take.  frames this problem as one of inference in a probabilistic graphical model called a Distributed Correspondence Graph (DCG). This model leverages the hierarchical structure of the syntactically parsed instruction and conditional independence assumptions across constituents of a discrete symbol space to improve the run-time of probabilistic inference. Other variations include the Hierarchical DCG (Propp et al., 2015) and Adaptive DCG (Paul et al., 2016) to further improve the run-time performance in cluttered environments with known environment models. Recently, these models have been used to augment perception and representations. (Daniele et al., 2017) uses DCG for supplementing perception with linguistic information for efficiently inferring kinematic models of articulated objects. (Duvallet et al., 2014;Hemachandra et al., 2015) use DCG to augment the representations by exploiting information in language instruction to build priors over the unknown parts of the world. A limitation of current applications of probabilistic graphical models for natural language symbol grounding is that they do not consider how to efficiently convert observations or measurements into sufficiently detailed representation suitable for inference. We propose to use DCG for the problem of adapting the perception pipelines for inferring task optimal representations.
Our work is most closely related to that of (Matuszek et al., 2013). Their work presents an approach for jointly learning the language and perception models for grounded attribute learning. Their model infers the subset of objects based on color and shape which satisfy the attributes described in the natural language description. Similarly, (Hu et al., 2016) proposes deep learning based approach to directly segment objects in RGB images that are described by the instruction. We differentiate our approach by expanding the diversity and complexity of perceptual classifiers, enabling verbs to modify object representations, and presenting an end-to-end approach to representation adaptation and symbol grounding using computationally efficient probabilistic graphical models. In the following sections we introduce our approach to adapting perception pipelines, define our experiments, and present results against a suitable baseline.

TECHNICAL APPROACH
We describe the problem of understanding natural language instructions as one of probabilistic inference where we infer a distribution of symbols that express the intent of the utterance. The meaning of the instruction is taken in the context of a symbolic representation (Γ), observations (z t ) and a representation of the language used to describe the instruction (Λ). A probabilistic inference using a symbolic representation that is described by the space of trajectories X (t) that the robot may take takes the form of equation: x(t) * = arg max Solving this inference problem is computationally intractable when the space of possible trajectories is large. Contemporary approaches (Tellex et al., 2011; frame this problem as a symbol grounding problem, i.e. inferring the most likely set of groundings (Γ s * ) given a syntactically parsed instruction Λ = {λ 1 , ..., λ m } and the world model Υ.
Here, the world model Υ is a function of the constructs of the robot's perception pipeline (P ), and the raw observations z t .
The groundings Γ s are symbols that represent objects, their semantic properties, regions derived from the world model, and robot actions and goals such as grasping the object of interest or navigating to a specific region in the environment. The set of all groundings Γ s = {γ 1 , γ 2 , ..., γ n } is called as the symbol space. Thus the symbol space forms a finite space of interpretations in which the instruction will be grounded. The DCG is a probabilistic graphical model of the form described in equation 2. The model relates the linguistic components λ i ∈ Λ to the groundings γ j ∈ Γ s through the binary correspondence variables φ ij ∈ Φ. DCG facilitates inferring the groundings at a parent phrase in the context of the groundings at its child phrases Φ ci . Formally, DCG searches for the most likely correspondence variables Φ * in the context of the groundings γ ij , phrases λ i , child correspondences Φ ci and the world model Υ by maximizing the product of individual factors.
Inferred correspondence variables Φ * represent the expression of the most likely groundings Γ s * . The factors in the equation 4 are approximated by log-linear models Ψ: Model training involves learning the log-linear factors from the labeled data relating phrases with true groundings. Inference process involves searching for the set of correspondence variables that satisfy the above equation. The run-time performance of probabilistic inference with the DCG is positively correlated with the complexity of the world model Υ. This is because the size of the symbolic representation Γ s increases with the number of objects in the environment representation. Recognizing that some objects (and the symbols based on those objects) are inconsequential to the meaning of the instruction, we consider the optimal representation of the environment Υ * as one which is necessary and sufficient to solve equation 5. Thus we hypothesize that the time to solve equation 6 will be less than that for the equation 5.
Typically the environment model Υ is computed by a perception module P from a set of observations z 1:t = {z 1 . . . z t }. In cluttered environments we assume that inferring an exhaustively detailed representation of the world that satisfies all possible instructions is impractical for real-time human-robot interactions. We propose using language as mean to guide the generation of these necessary and sufficient environment representations Υ * in turn making it a task adaptive process. Thus we define Υ * inferred from a single observation as: where P denotes the perception pipeline of the robotic intelligence architecture. We adapt DCG to model the above function by creating a novel class of symbols called as perceptual symbols Γ P . Perceptual symbols are tied to their corresponding elements in the perception pipeline. i.e. to the vision algorithms. Since this grounding space is independent of the world model Υ, the random variable used to represent the environment is removed from equation 5. We add a subscript p to denote that we are reasoning in the perceptual grounding space.
Equation 8 represents the proposed model which we refer to as the language-perception model (LPM). It infers the symbols that inform the perception pipeline configurations given a natural language instruction describing the task. The space of symbols Γ P describe all possible configurations of the perception pipeline. For example, as shown in the Figure 1, for the instruction "pick up the leftmost blue gear", we may need elements in our pipeline that can detect blue objects and gears. Detecting green objects, spherical shapes, or sixdimensional pose of the chips can object would not be necessary to generate the symbols necessary for the robot to perform the instruction.
We assume that the perception pipeline (P ) is populated with a set of elements E = {E 1 , ..., E n } such that each subset E i ∈ E represents a set of algorithms that are responsible for inferring a specific property of an object. e.g. a red colordetection algorithm would be a member of the color detector family responsible for inferring the semantic property "color" of the object. While a six degree-of-freedom (DOF) pose detection algorithm would be a member of the pose detector family. More generally, E can be defined as: E = {e 1 , e 2 , ..., e m }. With these assumptions, we define our independent perceptual symbols as: We can imagine that these symbols would be useful to ground simple phrases such as "the red object" or "the ball" etc. where the phrases refer to a single property of the object. In the more complicated phrases such as "the red ball" or "the blue box" we have a joint expression of properties. i.e. we are looking for objects which maximize the joint likelihood p(red, sphere|o). Since these properties are independent we can infer them separately for every object o k ∈ O. However, we can represent the above joint likelihood expression as p(red, sphere) = p(red)p(sphere|red). In this case, it allows conditioning the evaluation of sphere detection on only a subset of objects which were classified as being red by the red detector. To add this degree of freedom in the construction of the perception pipeline, we define additional set of symbols which we refer to as conditionally dependent perceptual symbols: The expression of the symbol γ e i ,e j refers to running the element e i from the perception pipeline on the subset of objects which were classified positive by the element e j . Finally the complete perceptual symbol space is:

EXPERIMENTAL DESIGN
Herein with our experiments we demonstrate the utility of our language perception model for the task of grounded language understanding of the manipulation instructions. As shown in Figure 3 the process involves two distinct inferences: Inferring the perceptual groundings given a language instruction ( eq. 8 ), and inferring high level motion planning constraints given the language and the generated world model ( eq. 5 and eq. 6 ). In this section we describe our assumptions, and define the distinct symbolic representations used in our experiments for each of the above tasks. We then discuss our instruction corpus and the details of the individual experiments.

For our experiments a Rethink Robotics Baxter
Research Robot is placed behind a table. The robot is assumed to perceive the environment using a head-mounted RGB-D sensor. Robot's work space is populated using objects from the standard YCB dataset (Berk Calli, 2017), custom 3D printed ABS plastic objects, and multicolored rubber blocks. We define the world complexity in terms of the number of objects present on the table in the robot's field of view. The world complexity ranges from 15 to 20 in our experiments.

Symbolic Representation
The symbolic representation defines the space of symbols or meanings in which the natural language instruction will be grounded or understood. As mentioned before we define two distinct sets of symbols in our experiments. Γ P defines the set of perceptual symbols which are used by the language perception model, and Γ S defines the set of symbols which are used by the symbol grounding model. Γ P is a function of the elements E of the perception pipeline. The elements e i ∈ E in our perception pipeline are selected such that they can model the robot's environment with a spectrum of semantic and metric properties which will be necessary towards performing symbol grounding and planning for all of the instructions in our corpus. In our experiment we define E as: Here, C is a set of color detectors, G is a set of geometry detectors, L is a set of object label detectors, B is a set of bounding box detectors, R is a set of region detectors, and P is a set of pose detectors.
where color = {red, green, blue, white, yellow, or-ange}, geometry = {sphere, cylinder, cuboid}, label = {crackers box, chips can, pudding box, master chef can, bleach cleanser, soccer ball, mustard sauce bottle, sugar packet}, bbox = {non-oriented, oriented }, region = {left, right, center}, pose = { 3 DOF, 6 DOF }. Given the perception elements defined in the equation 13, we define the independent perceptual groundings ( Γ ID P ) previously defined in equation 9 as follows: We define the conditionally dependent perceptual groundings ( Γ CD P ) previously defined in equation 10 as following: These symbols provide us the ability to selectively infer desired properties in the world. Above presented independent and conditionally dependent symbols together cover the complete space of perceptual symbols used by the LPM: Algorithmic details of the percepion elements are as follows : A single RGB point cloud is fed in as a raw sensor observation to the pipeline. A RANSAC (Fischler and Bolles, 1981) based 3D plane detection technique is used for segmenting the table-top and the objects. HSV colorspace is used for detecting colors. RANSAC based model fitting algorithms form the core of the geometry detectors. A 4 layer ( 256 -128 -64 -32 ) feed forward neural network is trained to infer the semantic labels of the objects. It takes in a 32 x 32 RGB image and infers a distribution over 8 unique YCB object classes. A PCA based oriented bounding box estimation algorithm is used to approximate the 6 DOF pose for the individual objects. Algorithms are implemented using OpenCV and PCL library (Rusu and Cousins, 2011).
The space of symbols for the symbol grounding model is similar to the representation defined in (Paul et al., 2016). This space uses symbols to represent objects in the world model (Γ O ), semantic object labels (Γ L ), object color(Γ C ), object geometry(Γ G ) regions in the world(Γ R ), spatial relationships (Γ SR ) and finally high level planning constraints that define the end goal (Γ PC ). The inferred constraints forms an input to a planning algorithm that can then generate trajectories to accomplish the desired task. Thus the complete symbol space for the symbol grounding model is:

Corpus
For training and testing the performance of the system we generate an instruction corpus using the linguistic patterns similar to that described in (Paul et al., 2016). The corpus used in our experiments consists of 100 unique natural language instructions. Details of the grammar extracted from this corpus is described in the appendix. Each instruction describes a manipulation command to the robot while referring to the objects of interest using their semantic or metric properties. e.g. "pick up the green cup" or "pick up the biggest blue object". If multiple instances of the same objects are present in the robot's work space then the reference resolution is achieved by using spatial relationships to describe the object of interest. e.g."the leftmost blue cube" or "rightmost red object" etc.
As shown in Figure 2, the instructions in the corpus are in the form of syntactically parsed trees. Each instruction is generated in the context of a specific table-top object arrangement. Thus each instruction is associated with a pair of RGB-D image. A total of 10 unique table-top arrangements are used to generate the set of 100 instructions.
One copy of the corpora is annotated for training LPM using (Γ P ) while another for training the symbol grounding model using (Γ S ). The annotations for LPM corpus are selected such that that the perception pipelines configured using the annotated groundings would generate the optimal world representations that are necessary and sufficient to support grounding and planning for the given tasks.
We have instructions with varying complexity in our corpus. The instruction complexity from the perception point of view is quantified in terms of the total number of perceptual groundings expressed at the root level. e.g. "pick up the ball" is relatively a simple instruction with only single grounding expressed at the root level, while "pick up the blue cube and put the blue cube near the crackers box" is a more complicated instruction having seven groundings expressed at the root level. This number was found to vary in the range of one to seven in our corpus.

Experiments and Metrics
We structure our experiments to validate two claims. The first claim is that adaptively inferring the task optimal representations reduce the perception run-time by avoiding exhaustively detailed uniform modeling of the world. The second claim is that reasoning in the context of these optimal representations also reduces the inference run-time of the symbol grounding model. An outline of our experiments is illustrated in Figure 3. In the first experiment, we study the root-level inference accuracy of LPM ( groundings expressed at the root level of the phrase ) as a function of the gradual increase in the training fraction. For each value of training fraction in the range [ 0.2 , 0.9 ] increasing with a step of 0.1, we perform 15 validation experiments. The training data is sampled randomly for every individual experiment. Additionally, we perform a leave-one-out cross validation experiment. We use the inferences generated by the leave-one-out cross validation experiments as inputs to drive the adaptive perception for each instruction.
In the second experiment, we compare the cumulative run-time of LPM inference ( eq. 8 ) and adaptive perception ( T 1 +T 2 ) against the run-time for complete perception ( T 4 ) -our baseline, for increasingly complex worlds.
In the third experiment, we compare the inference time of the symbol grounding model reasoning in the context of the adaptively generated optimal world models ( T 3 , eq. 6 ) against the inference time of the same model but when reasoning in the context of the complete world models ( T 5 , eq. 5 ). We also check whether the planning constraints inferred in both cases match the ground truth or not. Experiments are performed on a system running a 2.2 GHz Intel Core i7 CPU with 16 GB RAM.

RESULTS
This section presents the results obtained for the above mentioned three experiments. Specifically, the learning characteristics of LPM, the impact of LPM on the perception run-time, and the impact the adaptive representations on the symbol grounding run-time.
Leftmost graph in the Figure 4 shows the results of the first experiment. We can see that the inference accuracy grows as a function of a gradual increase in the training data. A growing trend is an indicator of the language diversity in the corpus.
Mean inference accuracy starts at 39.25%±5 for k = 0.2 and it reaches 84% for leave-one-out cross validation experiment ( k = 0.99 ).
Middle graph in the Figure 4 shows the result of the second experiment. We can clearly see that the run-time for complete perception grows with the world complexity while the run-time of adaptive perception stays nearly flat and is significantly lower in all cases. Since the adaptive perception run-time varies according to the task, we see bigger error bars. The drop in the complete perception run-time for world complexity of 20 is justifiable as the run-time of our geometry detection algorithm was proportional to the size of the individual objects, and all of the objects for that example world were smaller than other examples.

World
T 4 ( sec ) Rightmost graph in the Figure 4 shows the result of the third experiment. It shows that the symbol grounding run-time when reasoning in the context of detailed world models( Υ ) grows as a function of the world complexity. However, it is significantly lower when reasoning in the context of adaptively generated world models ( Υ * ) and is independent of the world complexity.
The achieved run-time gains are meaningful Figure 4: Graph on the left shows the LPM inference accuracy as a function of gradual increase in the training fraction. In the middle is the bar chart comparing the run-time for complete perception ( T 4 ) against the cumulative run-time of LPM inference and adaptive perception ( T 1 + T 2 ). Finally, on the right is a bar chart comparing the run-time of symbol grounding when reasoning in the context of the adaptively generated optimal representations ( T 3 ) against when reasoning in the context of exhaustively detailed world models ( T 5 ). The error bars indicate 95% confidence intervals.   Table 3: Impact of LPM on average perception run-time per instruction (T P ), average symbol grounding run-time per instruction (T SG ), and the symbol grounding accuracy.

CONCLUSIONS
Real-time human-robot interaction is critical for the utility of the collaborative robotic manipula-tors in shared tasks. In scenarios where inferring exhaustively detailed models of all the objects is prohibitive, perception represents a bottleneck that inhibits real-time interactions with collaborative robots. Language provides an intuitive and multiresolution interface to interact with these robots. While recent probabilistic frameworks have advanced our ability to interpret the meaning of complex instructions in cluttered environments, the problem of how language can channel the interpretation of the raw observations to construct world models which are necessary and sufficient for the symbol grounding task is not extensively studied. Our proposed DCG based Language Perception Model, demonstrates that we can guide perception using language to construct world models which are suitable for efficiently interpreting the instruction. This provides run-time gains in terms of both perception and symbol grounding, thereby improving the speed with which collaborative robots can understand and act upon human instructions. In ongoing and future work we are exploring how language can aid efficient construction of global maps for robot navigation and manipulation by intelligently sampling relevant observations from a set of observations.

A Grammar and Lexicon of the Corpus
We list the grammar rules and the lexicon for our corpus to demonstrate the diversity of the instructions. Following