Jointly Learning Grounded Task Structures from Language Instruction and Visual Demonstration

To enable language-based communication and collaboration with cognitive robots, this paper presents an approach where an agent can learn task models jointly from language instruction and visual demonstration using an And-Or Graph (AoG) representation. The learned AoG captures a hierarchical task structure where linguistic labels (for language communication) are grounded to corresponding state changes from the physical environment (for perception and action). Our empirical results on a cloth-folding domain have shown that, although state detection through visual processing is full of uncertainties and error prone, by a tight integration with language the agent is able to learn an effective AoG for task representation. The learned AoG can be further applied to infer and interpret on-going actions from new visual demonstration using linguistic labels at different levels of granularity.


Introduction
Given tremendous advances in robotics, computer vision, and natural language processing, a new generation of cognitive robots have emerged that aim to collaborate with humans in joint tasks. To facilitate natural and efficient communication with these physical agents, natural language processing will need to go beyond traditional symbolic representations, but rather ground language to sensors (e.g., visual perception) and actuators (e.g., lower-level control systems) of physical agents. The internal task * The first two authors contributed equally to this paper.
representation will need to capture both higher-level concepts (for language communication) and lowerlevel visual features (for perception and action).
To address this need, we have developed an approach on learning procedural tasks jointly from language instruction and visual demonstration. In particular, we use And-Or Graph (AoG), which has been used in many computer vision tasks and robotic applications (Zhao and Zhu, 2013;Li et al., 2016;, to represent a hierarchical task model that not only captures symbolic concepts (extracted from language instructions) but also the corresponding visual state changes from the physical environment (detected by computer vision algorithms).
Different from previous works that ground language to perception (Liu et al., 2012;Matuszek et al., 2012;Kollar et al., 2013;Yu and Siskind, 2013;, a key innovation in our framework is that language is no longer grounded just to perceived objects in the environment, but is further grounded to a hierarchical structure of state changes where the states are perceived from the environment during visual demonstration. The state of environment is an important notion in robotic systems as the change of states drives planning for lower-level robotic actions. Thus, connecting language concepts to state changes, our learned AoG provides a unified representation that integrates language and vision to not only support language-based communication but also facilitate robot action planning and execution in the future. More specifically, within this AoG framework, we have developed and evaluated our algorithms in the context of learning a cloth-folding task. Although cloth-folding appears simple and intuitive for humans, it represents significant challenges for both vision and robotics systems. Furthermore, although symbolic language processing in this domain is easy due to limited use of vocabulary, grounded language understanding is particularly challenging. A simple phrase (e.g., "fold in half") could have different grounded meanings (e.g., lower-level representation) given different contexts. Thus, this clothfolding domain is a good starting point to focus on grounding language to task structures.
Our empirical results have shown that, although state detection from the physical world can be extremely noisy, our learning algorithm that tightly incorporates language is capable of acquiring an effective and meaningful task model to compensate the uncertainties in visual processing. Once the AoG for the task is learned, it can be applied by our inference algorithm, for example, to infer on-going actions from new visual demonstration and generate linguistic labels at different levels of granularity to facilitate human-agent communication.
Research on Learning from Demonstration (LfD) employed various approaches to model the tasks (Argall et al., 2009), such as state-to-action mapping , predicate calculus (Hofmann et al., 2016), and Hierarchical Task Networks (Nejati et al., 2006;Hogg et al., 2009). However, aspiring to enable human robot communication, the framework developed in this paper focuses on task representation using language grounded to a structure of state changes detected Figure 1: The setting of our situated task learning where a human teacher teaches the robot how to fold a T-shirt through both task demonstrations and language instructions. from the physical world. As demonstrated in recent work (She et al., 2014a;Misra et al., 2015;She and Chai, 2016), explicitly modeling change of states is an important step towards interacting with robots in the physical world.
Additionally, there has also been an increasing amount of work that learns new tasks either using methods like supervised learning on large corpus of data (Branavan et al., 2010;Branavan et al., 2012;Tellex et al., 2014;Misra et al., 2014), or by learning from humans through dialogue (Cantrell et al., 2012;Mohan et al., 2013;She et al., 2014b;Mohseni-Kabir et al., 2015). In this paper, we focus on jointly learning new tasks through visual demonstration and language instruction. The learned task model is explicitly represented by an AoG, a hierarchical structure consisting of both linguistic labels and corresponding changes of states from the physical world. This rich task model will facilitate not only language-based communication, but also lower-level action planning and execution.

Task and Data
In this paper, we use cloth-folding (e.g., teaching a robot how to fold a T-shirt) as the task to demonstrate and evaluate our joint task learning approach. As mentioned earlier, cloth-folding, although simple for humans, represents a challenging task for the robotics community due to the complex state and action space. Figure 1 illustrates the setting for our situated task learning. A human teacher can teach a robot how to fold a T-shirt through simultaneous verbal instructions and visual demonstrations. A Microsoft Kinect2 camera is mounted on the robot to record the human's visual demonstration, and the human's corresponding verbal instructions are recorded by Kinect2's embedded microphone array. A recorded video of task demonstration and its corresponding verbal instruction become one training example for our task learning system. Figure 2 shows two examples of such "parallel data". The visual demonstration is processed into a sequence of visual states, where each state is a numeric vector ( v i ) capturing the visual features of the T-shirt at a particular time (see later Section 5.1 for details). The recorded verbal instructions are then aligned with the sequence of visual states based on the timing information.
During teaching the task, we specifically requested the demonstrator to describe and do each step at roughly the same time. This greatly simplified the alignment problem. Since our ultimate goal is to enable humans to teach the robot through natural language dialogue and demonstration, our hypothesis is that the alignment issue can be alleviated by certain dialogue mechanism (e.g., ask to repeat the action, ask for step-by-step aligned instructions, etc.). As it is human's best interest that the robot gets the clearest instructions, we also anticipate during dialogue human teachers will be collaborative and provide mostly aligned instructions. Certainly, these hypotheses will need to be validated in the dialogue setting in our future work.
In our collected data, each change of state, i.e., a transition between two visual states, is caused by one or more physical actions. Some language descriptions align with only a single-step change of state. For instance, "fold right sleeve" is aligned with the change ( v 0 → v 1 ) and "fold left sleeve" is aligned with ( v 1 → v 2 ) in Example 1. This kind of single-step change of state is considered as a primitive action. Other language descriptions are aligned with a sequence of multiple state changes. For instance, "fold the two sleeves" in Example 2 is aligned with two consecutive changes: This kind of sequence of state changes is considered as a complex action, which can be decomposed into partially ordered primitive actions. A complex action can also be concisely represented by the change from the initial state to the end state in the sequence, such as These parallel data are used to train and test our learning and inference algorithms presented later.

And-Or Graph Representation
We use AoG as the formal model of a procedural task. Figure 3 shows an example AoG for the clothfolding task. It is a hierarchical structure that explicitly captures the compositionality and reconfigurability of a procedural task. The terminal nodes capture state changes associated with primitive actions of this task, and non-terminal nodes capture state changes associated with complex actions which are further composed by lower-level actions.
In addition to state changes, the learned AoG is also labeled with linguistic information (e.g., verb frames) capturing the causes of the corresponding state changes. The state changes are also considered as grounded meanings of these verb frames. For example, Figure 3 shows two "fold the t-shirt" labels at the top layer. Note that although symbolically, these two phrases have the same meaning (e.g., same verb frames), their grounded meanings are different as they correspond to different changes of state. Being able to represent differences or ambiguities in grounded meanings is crucial to connect language to perception and action.
Formally, an AoG is defined as a 5-tuple G = (S, Σ, N, R, Θ), where • S is a root node (or a start symbol) representing a complete task. • Σ is a set of terminal nodes, each of which represents a change of state associated with a primitive action.
nodes, which is divided into two disjoint sub- sets of And-nodes and Or-nodes. • R is a "child-parent" relation (many-to-one mapping), i.e., R(n ch ) = n pa (meaning n pa is the parent node of n ch ), where n ch ∈ Σ ∪ N and n pa ∈ N ∪ {S}. • Θ is a set of conditional probabilities p(n ch |n pa ), where n pa ∈ N OR , n ch ∈ {n | R(n) = n pa }. Namely, for each Or-node, Θ defines a probability distribution over the set of all its children nodes.
In essence, our AoG model is equivalent to Probabilistic Context-Free Grammar (PCFG). An AoG can be converted into a PCFG: • Each And-node and its children form a production rule n AN D → n ch 1 ∧ n ch 2 ∧ . . . that represents the decomposition of a complex action into sequentially ordered sub-actions.
• Each Or-node and its children form a production rule n OR → n ch 1 | n ch 2 | . . . that represents all the alternative ways of accomplishing an action. Each alternative also comes with a probability as specified in Θ.

Vision and Language Processing
The input data to our AoG learning algorithm consist of co-occurring visual demonstrations and language instructions as described in Section 3. Based on the RGB-D information provided by the Kinect2 sensor, we developed a vision processing system to keep track of human's actions and statuses of the Tshirt object.
To learn a meaningful task structure, the most important visual information are those key statuses that the object goes through. Therefore, our vision system processes each visual demonstration into a sequence of states. Each state v is a multi-dimensional numeric vector that encodes the geometric information of the detected T-shirt, such as its smallest bounding rectangle and largest inscribed contourfitting rectangle. These key states are detected by tracking the human's folding actions. Namely, whenever a folding action is detected 1 , we append the new state caused by the action to the sequence of observed states, till the end of the demonstration.
The verbal instructions given by the demonstrators were mainly verb phrases such as "fold whichpart", "fold to which-position", or "fold in-whatmanner". A semantic parser 2 is applied to parse each instruction text into a canonical verb-frame representation, such as Through the vision and language processing, each task demonstration becomes two parallel sequences, i.e., a sequence of extracted visual states and a sequence of parsed language instructions. The align-1 The vision system keeps track of human's hands, and detects a folding action as a gripping action followed by moving and releasing the hand(s). 2 We use the CMU's Phoenix parser: http://wiki.speech.cs.cmu.edu/olympus/index.php/Phoenix ment between these two sequences is also extracted from their co-occurrence timing information. Thus, an instance of a task demonstration is formally represented as a 3-tuple x = (D, L, ∆), where D = { v 1 , v 2 , . . . , v M } is the sequence of visual states, L = {l 1 , l 2 , . . . , l K } is the sequence of linguistic verb-frames, and ∆(k) = (i, j) is an "alignment function" specifying the correspondence between a linguistic verb-frame l k and a single or a sub-sequence of visual state(s) Then, given a dataset X of such task demonstrations, our AoG learning algorithm learns an AoG G as defined in Section 4. The next section describes our learning algorithm in detail.

AoG Learning Algorithm
Learning an AoG G = (S, Σ, N, R, Θ) is carried out in two stages. Firstly, we learn a set of terminal nodes Σ to represent the primitive actions (i.e., the actions that can be preformed in a single step). This is done through clustering the observed visual states. Secondly, the hierarchical structure (i.e., N and R) and parameters Θ of the AoG is learned using an iterative grammar induction algorithm.

Learning Terminal Nodes
A terminal node in the AoG represents a primitive action, which causes the object to directly change from one state to another. Thus we represent a terminal node as a 2-tuple of states (or a "change of state"). Since the visual states detected by computer vision are numeric vectors with continuous values, we first apply a clustering algorithm to form a finite set of discrete state representations. Each cluster then represents a unique situation that one can encounter in a task. Since when learning a new task we usually do not know how many unique situations exist, here we employ a greedy clustering algorithm (Ng and Cardie, 2002), which does not assume a fixed number of clusters.
As the greedy clustering algorithm relies on the pairwise similarities of all the visual states, we also train an SVM classifier on a separate dataset of 22 Tshirt folding videos and use its classification output to measure the similarity between two visual states. The SVM classifier takes two numeric vectors as an input, and predicts whether these two vectors represent the same status of a T-shirt. We then apply this SVM classifier on each pair of detected visual states in our new dataset (i.e., the dataset for learning the AoG), and use the SVM's output class label (1 or −1) multiplies its classification confidence score as the similarity measurement between two visual states.
After clustering all the observed visual states in the data, we then replace each numeric vector state representation with the cluster "ID" it belongs to. Thus each visual demonstration now becomes a sequence of symbolic values, denoted as D = {s 1 , s 2 , . . . , s M }. And we further transform it into an equivalent change of state sequence C = {(s 1 , s 2 ), (s 2 , s 3 ), . . . , (s M −1 , s M )}, in which each change of state essentially represents a primitive action in this task. These change of state pairs then form the set of terminal nodes Σ.

Learning the Structure and Parameters
With the sequences of numeric vector states replaced by the "symbolic" change of state sequences in the first stage, we can further learn the structure and parameters of an AoG. Namely, to learn N , R, and Θ that maximize the posterior probability:  To solve the first term, we use greedy or beam search with a heuristic function similar to (Solan et al., 2005). To solve the second term, we estimate the probability of each branch of an Or-node by computing the frequency of that branch, which is essentially a maximum likelihood estimation similar to (Pei et al., 2013).
In detail, the learning procedure first initializes empty N , R, and Θ, then iterates through the following two steps until no further update can be made.
Step (1): search for new And-nodes.
This step searches for new And-node candidates from Σ∪N , and update N and R with the top-ranked candidates. Specifically, we denote an And-node candidate to be searched as A = (s l → s m → s r ).
Here s l is the initial state of an existing node, whose end state is s m . And s r is the end state of another existing node, whose initial state is s m . Thus an Andnode candidate always has two child nodes, and represents a pattern of sub-sequences which starts from state s l , ends at s r , and has s m occurred somewhere in the middle.
Using the above notation, the heuristic function for ranking And-node candidates is defined as where P state (A) captures the prevalence of a particular And-node candidate based on the observed state change sequences: P state (A) = P R (A) + P L (A) 2 and P R (A) is the ratio between the number of times (s l → s m → s r ) appears and the number of times (s l → s m ) appears, and P L (A) is the ratio between the number of times (s l → s m → s r ) appears and the number of times (s m → s r ) appears.
The component P label (A) captures the prevalence of linguistic labels associated with the sequential state change patterns. It is computed as the ratio between the number of times (s l → s m → s r ) co-occurs with a linguistic instruction 3 and the total number of times (s l → s m → s r ) appears.
We specially define two AoG learning settings based on the role that language plays: • Tight language integration: incorporate heuristics on linguistic labels (i.e., λ = 0.5). In this setting, the learned AoG prefers And-nodes that not only happen frequently, but also can be described by a linguistic label. • Loose language integration: without incorporating the heuristics on linguistic labels (λ = 0). Each And-node is learned only based on the frequency of its state change pattern. The learned node can still acquire a linguistic label if there happen to be a co-occurring one, but the chance is lower than the "tight" setting.
Step (2): search for new Or-nodes or new branches of existing Or-nodes, then update Θ.
Once new And-nodes are added by the previous step, the next step is to search for Or-nodes that can be created or updated. An Or-node in the AoG essentially represents the set of all And-nodes that share the same initial and end states, denoted as (s l → s r ) here (s l and s r are the common initial and end states, respectively). Suppose (s l → s m → s r ) is a newly added And-node, it is then assigned as a child of the Or-node (s l → s r ). To further update Θ, the branching probability is computes as the ratio between the number of times (s l → s m → s r ) appears and the number of times (s l → s r ) appears.

Inference Using AoG
Once a task AoG is learned, it can be applied to interpret and explain new visual demonstrations using linguistic labels. Due to the noises and uncertainties from computer vision processing, one key challenge in interpreting the visual demonstration is to reliably identify the different states of the T-shirt.
To tackle this issue, we formulate a joint inference problem. Namely, given a task demonstration video, we first process it into a sequence of numeric vector states D = { v 1 , v 2 , . . . , v M } as described in Section 5.1. Then the goal of inference is to find the most-likely parse tree T and a sequence of "symbolic states" D = {s 1 , s 2 , . . . , s M } based on the AoG G and the input D: We apply a chart parsing algorithm (Klein and Manning, 2001) to efficiently solve this problem. Furthermore, to accommodate the ambiguities in mapping a numeric vector state v m to a symbolic state, we take into consideration the top-k hypotheses measured by the similarity between v m and a symbolic state s k . 4 For each state mapping hypothesis, we add a completed edge between indices m and m + 1 in the chart, with s k as its symbol and a probability p based on the similarity between v m and s k . Based on the given AoG, the chart parsing algorithm then uses Dynamic Programming to search the best parse tree that maximizes the joint probability of P (T, D |G, D). Figure 4 illustrates the input and output of our inference algorithm. As illustrated by this example, the parse tree represents a hierarchical structure underlying the observed task procedure, and the linguistic labels associated with the nodes can be used to describe the primitive and complex actions involved in the procedure.

Evaluation
Using the setting as described in Section 3, we collected 45 T-shirt folding demonstrations from 6 people to evaluate our AoG learning and inference methods. More specifically, we conducted a 5-fold cross validation. In each fold, 36 demonstrations were used for training to learn a task AoG. Then the remaining 9 demonstrations were used for testing, in which the learned AoG is further applied to process each of the testing visual demonstrations.
Motivated by earlier work on plan/activity recognition using CFG-based models (Carberry, 1990;Pynadath and Wellman, 2000), we use an extrinsic task that automatically assigns linguistic labels to new demonstrations to evaluate the quality of the learned AoG and the effectiveness of the inference algorithm. This involves three steps: (1) parse the video using the learned AoG; (2) identify linguistic labels associated with terminal or nonterminal nodes in the parse tree; and (3) compare the identified linguistic labels with the manually annotated labels.
We conduct the evaluation at two levels: • Primitive actions: use linguistic labels associated with terminal nodes to describe the primitive actions in each video. This level provides detailed descriptions on how the observed task procedure is performed step-by-step.
• Complex actions: use linguistic labels associated with nonterminal nodes to describe complex actions. This provides a high-level "summary" of the detailed low-level actions.
The capability to recognize fine-grained primitive actions as well as high-level complex actions in a task procedure and to communicate those in language is important for many real-world AI applications such as human-robot collaboration (Mohseni-Kabir et al., 2015) and visual question answering (Tu et al., 2014).

Primitive Actions
We first compare the performance of interpreting primitive actions using the learned AoG with a baseline. The baseline applies a memory-based (or similarity-base) approach. Given a testing video, it extracts all the different visual states and maps each state to the nearest cluster learned from the training data (see Section 5.2.1). It then pairs each two consecutive states as a change of state instance, and uses the linguistic label corresponding to the identical change of state found in the training data as the label of a primitive action.
We measure the primitive action recognition performance in terms of the normalized Minimal Edit Distance (MED). Namely, for each testing demonstration we calculate the MED between the groundtruth sequence of primitive action labels and the automatically generated sequence of labels, and divide the MED value by the length of the ground-truth sequence to produce a normalized score (a smaller score indicates better performance in recognizing the primitive actions). The performances of the baseline and our AoGbased approach are shown in Figure 5. For the AoG-based approach, Figure 5 also shows the performances of incorporating different number of state mapping hypotheses (i.e., k = 1, 5, 10, 15, 20) into the inference algorithm (Section 5.3). Here we only report the performance of using AoG learned with the tight language integration (see Section 5.2.2), since there is no difference in performance between the tight and loose language integration settings in recognizing primitive actions 5 .
As Figure 5 shows, the baseline performance is rather weak (i.e., high MED scores). This is largely due to the noise in state clustering and mapping from vision. After manually inspecting the collected demonstration videos, we found 18 unique statuses associated with folding a T-shirt. However the computer vision based clustering on average produces more than 30 clusters when all the 36 training examples are used. This makes it difficult to directly match the state changes as in the baseline. For our AoG-based method, when the inference algorithm only takes the single best state mapping hypothesis into consideration (i.e., k = 1), it yields a very weak performance because the observed state change sequence often cannot be parsed using the learned AoG.
However, the performance of the AoG-based 5 Because the linguistic labels generated for primitive actions are all from terminal nodes, and the two different AoG learning settings only affect nonterminal nodes. method is significantly improved when multiple state mapping hypotheses are incorporated into the inference process. When the top-5 (k = 5) state mapping hypotheses are incorporated into the AoGbased inference, its MED score has already outperformed the baseline by a 0.3 gap (p < 0.001 using the Wilcoxon signed-rank test). When k = 20, the MED score has dropped by more than 0.6 compared to k = 1 (p < 0.001).
These results indicate that our AoG-based method is capable of learning useful task structure from small data. When multiple hypotheses of visual state mapping are incorporated, the learned AoG can compensate the uncertainties in vision processing and identify highly reliable primitive actions from unseen demonstrations.

Complex Actions
We further evaluate the performance of interpreting complex actions using the learned AoG. The baseline for comparison is similar to the one used in the previous section. It first converts a test video into a sequence of "symbolic" states by mapping each detected visual state to its nearest cluster. It then enumerates all the possible segments that consist of more than two consecutive states and search for the identical segments in the training data. If a matching segment is found, then the corresponding linguistic label (if any) is used as the label for a complex action. Since complex actions correspond to nonterminal nodes in the parse tree generated by AoG-based inference, and some of them may have linguistic labels while others may not. We use precision, recall, and F-score to measure how well the generated linguistic labels match the manually segmented and annotated complex actions in testing videos. Figure 6 shows the F-scores of recognizing complex actions using the AoG learned from the loose and the tight language integration, respectively. In this figure, results are based on k = 20 state mapping hypotheses incorporated into the inference algorithm. As shown here, performances from both settings are significantly better than the baseline (p < 0.001). The AoG learned based on the tight integration with language yields significantly better performance than the loose integration (over 0.2 gain on F-score, p < 0.001).
This result indicates that the tight integration of language during AoG learning favors And-node patterns that are more likely to be described by natural language (or more consistent with human conceptualization of the task structure) 6 . Such an AoG representation can lead to recognition of video segments that can be better explained or summarized by human language. This capability of learning explicit and language-oriented task representations is important to link language and vision for enabling situated human-agent communication/collaboration. Table 6.2 further shows the results from different numbers of state mapping hypotheses that are incorporated into the inference algorithm. As shown here, the trend of performance improvement with the increase in k is again observed. When multiple state mapping hypotheses are incorporated in inference, the learned AoG is capable of compensating uncertainties in vision processing and producing better parses for unseen visual demonstrations.

Conclusion and Future Work
This paper presents an approach on task learning where an agent can learn a grounded task model from human demonstrations and language instructions. A key innovation of this work is grounding language to a perceived structure of state changes 6 By further investigating the learned AoG under the two different settings, we found that the nonterminal nodes learned from the tight language integration setting is more likely to acquire a linguistic label (33%) than the nonterminal nodes learned from the loose setting (18%). based on AoG representation. Once the task model is acquired, it can be used as a basis to support collaboration and communication between humans and agents/robots. Using cloth-folding as an example, our empirical results have demonstrated that tightly integrating language with vision can effectively produce task structures in AoG that can generalize well to new demonstrations. Although we have only made an initial attempt on a small task, our approach can be naturally extended to more complex tasks such like assembling and cooking. Both the AoG representation and the task learning approach are general and applicable to different domains. What needs to be adapted is the representation of the visual states and computer vision algorithms to detect these states for a specific task.
Grounding language to a structure of perceived state changes will provide an important stepping stone towards integrating language, perception, and action for human-robot communication and collaboration. Currently, our algorithms learn the task structures based on offline parallel data. Our future work will explore incremental learning through humanagent dialogue to acquire grounded task structures.