Guiding Interaction Behaviors for Multi-modal Grounded Language Learning

Multi-modal grounded language learning connects language predicates to physical properties of objects in the world. Sensing with multiple modalities, such as audio, haptics, and visual colors and shapes while performing interaction behaviors like lifting, dropping, and looking on objects enables a robot to ground non-visual predicates like “empty” as well as visual predicates like “red”. Previous work has established that grounding in multi-modal space improves performance on object retrieval from human descriptions. In this work, we gather behavior annotations from humans and demonstrate that these improve language grounding performance by allowing a system to focus on relevant behaviors for words like “white” or “half-full” that can be understood by looking or lifting, respectively. We also explore adding modality annotations (whether to focus on audio or haptics when performing a behavior), which improves performance, and sharing information between linguistically related predicates (if “green” is a color, “white” is a color), which improves grounding recall but at the cost of precision.


Introduction
Connecting human language predicates like "red" and "heavy" to machine perception is part of the symbol grounding problem (Harnad, 1990), approached in machine learning as grounded language learning. For many years, grounded language learning has been performed primarily in visual space (Roy and Pentland, 2002;Liu et al., 2014;Malinowski and Fritz, 2014;Mohan et al., 2013;Sun et al., 2013;Dindo and Zambuto, 2010;Vogel et al., 2010). Recently, researchers have explored grounding in audio (Kiela and Clark, 2015), haptic (Alomari et al., 2017), and multimodal (Thomason et al., 2016) spaces. Multimodal grounding allows a system to connect language predicates like "rattles", "empty", and "red" to their audio, haptic, and color signatures, respectively.
Past work has used human-robot interaction to gather language predicate labels for objects in the world (Parde et al., 2015;Thomason et al., 2016). Using only human-robot interaction to gather labels, a system needs to learn effectively from only a few examples. Gathering audio and haptic perceptual information requires doing more than looking at each object. In past work, multiple interaction behaviors are used to explore objects and add this audio and haptic information (Sinapov et al., 2014).
In this work, we gather annotations on what exploratory behaviors humans would perform to determine whether language predicates apply to a novel object. A robot could gather such information by asking human users which action would best allow it to test a particular property, e.g. "To tell whether something is 'heavy' should I look at it or pick it up?" Figure 1 shows some of the behaviors used by our robot in previous work to perceive objects and their properties. In this paper, we show that providing a language grounding system with behavior annotation information improves classification performance on whether predicates apply to objects, despite having sparse predicate-object labels.
We additionally explore adding modality annotations (e.g. is a predicate more auditory or more haptic), drawing on previous work in psychology that gathered modality norms for many words (Lynott and Connell, 2009). Finally, we explore using grasp lift lower drop press push Figure 1: Behaviors the robot used to explore objects. In addition, the hold behavior (not shown) was performed after the lift behavior by holding the object in place for half a second. The look behavior (not shown) was also performed for all objects.
word embeddings to help with infrequently seen predicates by sharing information with more common ones (e.g. if "thin" is common and "narrow" is rare, we can exploit the fact that they are linguistically related to help understand the latter).

Dataset and Methodology
Previous work provides sparse annotations of 32 household objects ( Figure 2) with language predicates derived during an interactive "I Spy" game with human users (Thomason et al., 2016). Each predicate p ∈ P from that work is associated with objects as applying or not applying, based on dialog with human users. For example, predicate "red" applies to several objects and not to others, but for many objects its label is not explicitly known. Objects are represented by features gathered during several interaction behaviors ( Figure 1) as detailed in past work . In this work, we focus on improving the language grounding performance of multimodal classifiers that predict whether each predicate p ∈ P applies to each object o ∈ O.
In previous work, decisions about a predicate and an object are made for each sensorimotor context (a combination of a behavior and sensory modality) with an SVM using the feature space for that context (Thomason et al., 2016). A summary of sensorimotor contexts is given in Table 1.

Behaviors
Modalities look color, fpfh drop, grasp, hold, lift lower, press, push audio, haptics Table 1: The contexts (combinations of robot behavior and perceptual modality) we use for multimodal language grounding. The color modality is color histograms, fpfh is fast-point feature histograms, audio is fast Fourier transform frequency bins, and haptics is averages over robot arm joint forces (detailed in ).
For example, a classifier is trained from the positive and negative object examples for "red" in look/color space as well as in the less relevant drop/audio space. These decisions are then averaged together, each weighted by its Cohen's-κ agreement with human labels using leave-one-out cross validation on the training data. In this way, the look/color space for "red" is expected to have high κ and a large influence on the decision, while drop/audio would have low κ and not influence the decision much. The decision d(p, o) ∈ [−1, 1] for predicate p and object o is defined as: for G p,c a supervised grounding classifier trained on labeled objects for predicate p in the feature space of sensorimotor context c that returns in {−1, 1} with κ p,c its agreement with human labels. If d(p, o) ≤ 0, we say p does not apply to o, else that it does. We use SVMs with linear kernels as grounding classifiers. We extend the weighting scheme between sensorimotor SVMs to include behavior information. For each predicate derived from the "I Spy" game Figure 3: The distribution over annotator-chosen behaviors (left) gathered in this work, as well as the distribution over modality norms (right) derived from previous work (Lynott and Connell, 2009), for the predicates "white" and "round". The fpfh modality is fast-point feature histograms.
in previous work, we gather relevant behaviors from human annotators. Annotators were asked to mark which among the 8 exploratory behaviors (Table 1) they would engage in to determine whether a given predicate applied to a novel object. Annotators could mark as many behaviors as they wanted for each predicate, but were required to choose at least one.
We gathered annotations from 14 people, then discarded the annotations from those whose average κ agreement with all other annotators was less than 0.4 (the poor-fair agreement range). This left us with 8 annotators whose average κ = .475 (moderate agreement). We release the full set of gathered annotations on 81 perceptual predicates across 8 behaviors as a corpus for community use. 1 Then, for each p ∈ P , we induce a distribution over behaviors b ∈ B based on the ratio of annotators that marked that behavior relevant, such that b∈B A B p,b = 1, with A B p,b equal to the proportion of annotators who marked behavior b relevant for understanding predicate p. Some predicates, like "white", have single behavior distributions. For other predicates, like "metal", annotators chose more complex combinations of behaviors. Figure 3 (Left) gives some examples of behavior distributions from our annotations. 1 http://jessethomason.com/publication_ supplements/robonlp_thomason_mooney_ behavior_annotations.csv The decision d B (p, o) considering behavior annotations is calculated as where c b is the behavior for sensorimotor context c.
We also experiment with adding modality annotations (Table 1). In particular, we derive a modality distribution for each p ∈ P such that m∈M A M p,m = 1 from modality exclusivity norms gathered by past work for auditory, gustatory, haptic, olfactory, and visual modalities (Lynott and Connell, 2009). We ignore gustatory and olfactory modalities, which have no counterpart in our sensorimotor contexts, and create A M p,m scores from the auditory, haptic, and visual modality norm means. The visual modality norm is split evenly between relevance scores A M p,color and A M p,fpfh , our visual color and shape modalities. The decision d M (p, o) considering modality annotations is calculated as When the predicate p does not appear in the norming dataset from past work 2 , a uniform A M p,m = 1/|M | is used. Figure 3 (Right) gives some examples of modality distributions from these norms.
The data sparsity inherent in language grounding from limited human interaction means some predicates have just a handful of positive and negative examples, while more common predicates may have many. If we have few examples for "narrow" but many for "thin," we can share some information between them. For example, if κ thin,grasp/haptic is high, we should trust the grasp/haptic sensorimotor context for "narrow" more than "narrow"s κ estimates alone suggest.
We explore sharing κ information between related predicates by calculating their cosine distance in word embedding space by using Word2Vec (Mikolov et al., 2013) vectors derived from Google News. 3 For every pair of predicates p, q ∈ P with word embedding vectors v p , v q we calculate similarity as  Table 2: Precision (p), recall (r), and f 1 (f1) of predicate classifiers across weighting schemes. mc gives majority class baseline. Weighting schemes consider only validation confidence (κ, as in previous work), confidence and behavior annotations (B+κ), confidence and modality annotations (M+κ), and confidence and word similarity (W+κ). Note that we show the average perpredicate f -measure, not the f -measure of the average per-predicate precision and recall.
which falls in [0, 1], and subsequently take a weighted average of κ values using these similarities as weights to get decisions d W (p, o) as
We calculated these metrics for each predicate 4 and averaged scores across all predicates. We use leave-one-object-out cross validation to obtain performance statistics for each weighting scheme. Table 2 gives the results for predicates that have at least 3 positive and 3 negative training object examples. 5 We observe that adding behavior annotations or modality annotations improves performance over 4 Decisions were made for each testing object and marked correct or incorrect against human labels that object, if available for the predicate. 5 The trends are similar when considering all predicates, but the scores and differences in performance are lower due to many predicates having only a single positive or negative example.
using kappa confidence alone, as was done in past work. Sharing kappa confidences across similar predicates based on their embedding cosine similarity improves recall at the cost of precision.
Adding behavior annotations helps more than adding modality norms, but we gathered behavior annotations for all predicates, while modality annotations were only available for a subset (about half). Adding behavior annotations helped the f -measure of predicates like "pink", "green", and "half-full", while adding modality annotations helped with predicates like "round", "white", and "empty".
Sharing confidences through word similarity helped with some predicates, like "round", at the expense of domain-specific meanings of predicates like "water". In the "I Spy" paradigm from which these data were gathered, the authors noted that "water" correlated with object weight because all of their water bottle objects were partially or completely full (Thomason et al., 2016). Thus, in that domain, "water" is synonymous with "heavy". In a less restricted domain, word similarity may add less real world "noise" to the problem.

Conclusions and Future Work
In this work, we have demonstrated that behavior annotations can improve language grounding for a platform with multiple interaction behaviors and modalities. In the future, we would like to apply this intuition in an embodied dialog agent. If a person asks a service robot to "Get the white cup.", the robot should be able to ask "What should I do to tell if something is 'white'?", a behavior annotation prompt. A human-robot POMDP dialog policy could be learned, as in previous work (Padmakumar et al., 2017), to know when this kind of follow-up question is warranted.
Additionally, we will explore other methods of sharing information between predicates from lexical information. For example, choosing a maximally similar neighboring word, rather than doing a weighted average across all known words, may yield better results (e.g. the best neighbor of "narrow" is "thin", so don't bother considering things like "green" at all).
Fellowship to the first author, an NSF EAGER grant (IIS-1548567), and an NSF NRI grant (IIS-1637736). A portion of this work has taken place in the Learning Agents Research Group (LARG) at UT Austin. LARG research is supported in part by NSF (CNS-1330072, CNS-1305287,