Unsupervised Online Grounding of Natural Language during Human-Robot Interactions

Allowing humans to communicate through natural language with robots requires connections between words and percepts. The process of creating these connections is called symbol grounding and has been studied for nearly three decades. Although many studies have been conducted, not many considered grounding of synonyms and the employed algorithms either work only offline or in a supervised manner. In this paper, a cross-situational learning based grounding framework is proposed that allows grounding of words and phrases through corresponding percepts without human supervision and online, i.e. it does not require any explicit training phase, but instead updates the obtained mappings for every new encountered situation. The proposed framework is evaluated through an interaction experiment between a human tutor and a robot, and compared to an existing unsupervised grounding framework. The results show that the proposed framework is able to ground words through their corresponding percepts online and in an unsupervised manner, while outperforming the baseline framework.


Introduction
An increasing number of service robots is employed in human-centered complex environments and interacts with humans on a regular basis. This creates a need for robots that are able to understand instructions provided in natural language, such as bring a glas of water or pick up a box, to execute them appropriately and thereby enable efficient collaboration with humans. To this end, connections between words, i.e. abstract symbols, and their corresponding percepts, i.e. meanings, need to be created because according to the "Symbol Grounding Problem", which was proposed in 1990 by Harnad (1990), abstract knowledge and language only has meaning, if it is linked to the physical world through mappings from words to corresponding percepts. Grounding approaches can in general be separated into supervised and unsupervised approaches. The former utilize the guidance of a human tutor, while the latter do not require any supervision and try to use co-occurrence information to identify through which percepts a word is grounded. Previous studies, such as (Kollar et al., 2010;Tellex et al., 2011;, that investigated unsupervised grounding employed algorithms that only work offline, i.e. these algorithms need to be trained before deployment with in advance collected perceptual data and words, which prevents these algorithms from being used in real-time human-robot interactions. Additionally, most previous studies did not consider ambiguous words, although the sentences humans produce are often ambiguous due to homonymy, i.e. one word refers to several percepts, and synonymy, i.e. one percept can be referred to by several different words. The latter do not need to be true synonyms, i.e. words that refer to the exact same meaning, instead, words only need to be synonyms as references to a percept in a particular set of situations, e.g. coca cola or lemonade instead of bottle. In this paper, a recently proposed unsupervised online grounding framework (Roesler and Nowé, 2019) is extended to handle real percepts obtained during human-robot interactions. More specifically, the learning framework is extended to first convert obtained percepts through clustering to an abstract representation, which is then used to ground all non-auxiliary words 1 of the encountered natural language instructions through cross-situational learning. Each shape, color, and action is referred to by at least two synonymous words, which need to be mapped to their corresponding geometric characteristics, color histograms, or kinematic features of the robot joints during action execution, to investigate the ability of the used frameworks to handle synonymous words. The grounding performance of the proposed framework is evaluated by comparing it to the grounding performance of a Bayesian grounding framework that has been used in several previous studies, e.g.  The rest of this paper is structured as follows: Sections (2 and 3) provide a brief overview of crosssituational learning and related work. Afterwards, an overview of the proposed unsupervised online grounding framework as well as the unsupervised Bayesian baseline framework is given in Sections (4 and 5). The experimental design and obtained results are described in Sections (6 and 7). Finally, Section (8) concludes the paper.

Background
Cross-situational learning (CSL) refers to the process of learning the meaning of words across multiple exposures to handle referential uncertainty. The basic idea is that a set of candidate meanings, i.e. mappings from words to percepts, can be created for every situation or context a word is used in and that the correct meaning can be obtained by determining the intersection of the sets of candidate meanings (Pinker, 1989;Fisher et al., 1994). Thus, the correct mapping between a word and its corresponding percepts, i.e. its meaning, will reliably reoccur across situations (Blythe et al., 2010;Smith and Smith, 2012). A number of experimental studies have confirmed that humans use CSL for word learning, when no prior knowledge of language is available (Akhtar and Montague, 1999;Gillette et al., 1999;Smith and Yu., 2008). Since CSL requires more than one exposure to learn a word, it belongs to the group of slow-mapping mechanisms through which most words are acquired (Carey, 1978). In contrast, fast-mapping allows words to be acquired through a single exposure, but it is only used for a limited number of words and can neither be explained nor achieved through CSL (Carey and Bartlett, 1978;Vogt, 2012). Many different algorithms have been proposed to simulate CSL in humans and enable artificial agents, such as robots, to learn the meaning of words by grounding them through percepts (Section 3).

Related Work
Grounding is used to obtain the meaning of an abstract symbol, e.g. a word, by linking it to perceptual information, i.e. the "real" world (Harnad, 1990). There exist many different approaches for grounding. She et al. (2014) grounded higher level symbols through already grounded lower level symbols with the help of a dialog system. Since the system requires a sufficiently large set of grounded lower level symbols as well as a professional tutor to answer its questions, its usefulness is limited. The need for a human tutor that knows the correct mappings also limits the applicability of the Naming Game, which allows an agent to quickly learn word-percept mappings (Steels and Loetzsch, 2012). In contrast to the previous approaches, cross-situational learning (Section 2), which assumes that one word appears several times together with the same perceptual feature vector so that a corresponding mapping can be created, does not require a human tutor for grounding (Siskind, 1996;Smith et al., 2011). Previous studies investigated the use of cross-situational learning for grounding of objects, actions, and spatial concepts Dawson et al., 2013). In most studies, grounding was conducted offline, i.e. perceptual data and words were collected in advance, which prevents these approaches from being used in realtime human-robot interactions. In contrast to these approaches, the framework used in this study learns the correct mappings from words to percepts online while interacting with humans and does not require separate training and test phases. Additionally, the majority of employed models were not able to handle ambiguous words, although, the sentences humans produce are often ambiguous due to homonymy and synonymy. One recent study showed that grounding of known synonyms does not require semantic or syntactic information and that such information can even have a negative effect, depending on the characteristics of the used information and how it is applied . Therefore, the online grounding mechanism employed in this study uses no additional semantic or syntactic information to ground synonyms.

Grounding Framework
The employed grounding framework consists of four parts: (1) 3D object segmentation component, which segments objects into point clouds to determine their geometric characteristics and colors, (2) Action recording component, which creates action feature vectors by recording the states of several joints while the robot is executing actions, (3) Percept clustering component, which obtains an abstract representation of percepts through clustering, and (4) Cross-situational learning based grounding component, which identifies auxiliary words and maps percepts to non-auxiliary words and phrases. The inputs and outputs of the individual parts are highlighted below, and described in detail in the following subsections.
• Output: Geometric characteristics and colors of objects.
2. Action recording: • Input: Changes of the robot's joint states during action execution. • Output: Action feature vectors representing the executed actions.

Cross-situational learning:
• Input: Natural language instructions and cluster numbers of percepts. • Output: Word to percept mappings.

3D Object Features
In this study, an unsupervised model based 3D point cloud segmentation approach is used to segment objects lying in a plane into separate point clouds because it is fast, reliable and does not need much prior knowledge, such as object models or the number of regions to process (Craye et al., 2016). The applied model uses the RANSAC algorithm (Fischler and Bolles, 1981) to detect the major plane in the environment, which is a tabletop in the conducted experiment, and keeps track of it in consecutive frames. If a plane is orthogonal to the major plane and touches at least one border of the image, it is defined as a wall plane. After filtering out points that belong to the main plane or wall planes, the remaining points are voxelized and clustered into blobs representing object candidates. Blobs that are neither extremely small nor large are treated as objects 2 . Point clouds of segmented objects are characterized through Viewpoint Feature Histogram (VFH) descriptors (Rusu et al., 2010), which represent the object geometries taking into consideration the viewpoints while ignoring scale variances, and color histograms, which represent the colors of the objects. Figure (1) provides an illustrative example of the obtained 3D point cloud information.

Action Features
Action feature vectors are used to represent the dynamic characteristics of actions during execution through teleoperation. Overall, five different characteristics, which represent possible subactions, are recorded through the sensors of the robot (Toyota Motor Corporation, 2017). The used characteristics are: 1. The distance from the actual to the lowest torso position in meters.
2. The angle of the arm flex joint in radians.
3. The angle of the wrist roll joint in radians.
4. Velocity of the base.
They are then combined into the following vector: where a 1 represents the difference of the distances from the lowest torso position in meters, while a 2 and a 3 represent the differences in the angles of the arm flex and wrist roll joints in radians, respectively. The differences are calculated by subtracting the values at the beginning of the subaction from the values at the end of the subaction. a 4 represents the mean velocity of the base (forward/backward), and a 5 represents the binary gripper state. Each action is characterized through six manually defined subactions. Therefore, if an action consists of less than six subactions, rows with zeros are added at the end, while the duration of a subaction is not fixed because it depends on the teleoperator.

Clustering of percepts
The CSL algorithm (Section 4.4) requires percepts to be converted to an abstract representation that can then be used to ground natural language. The abstract representation is obtained through clustering as proposed in (Roesler, 2019). Since it cannot be assumed that the number of clusters, i.e. the number of different percepts, is known in advance, DBSCAN, which is a density-based clustering algorithm proposed by Ester et al. (1996), is used 3 because it determines the number of clusters automatically, while only requiring two parameters, i.e. the radius and threshold minSamples. Each iteration DBSCAN determines a number of core points, which are points that have more than minSamples points within radius around them (Schubert et al., 2017). All the points within radius of a core point are assigned to the same cluster as the core point. Cluster numbers are calculated every situation prior to grounding so that they can be provided to the CSL algorithm. Recalculating them every situation is necessary to take into account the new percepts of that situation.

Cross-Situational Learning
A variety of algorithms have been developed that realize CSL in different ways, e.g. through the use of probabilistic models , to ground words through percepts in artificial agents. This section describes an online CSL algorithm for grounding of words, which has first been proposed by Roesler and Nowé (2018) and recently been extended with auxiliary word and phrase detection (Roesler and Nowé, 2019). Since the sentences in this study are shorter, have a much simpler structure, and less variation than the sentences used in (Roesler and Nowé, 2019), the previous auxiliary word and phrase detection algorithms do not work.
Algorithm 1 The grounding procedure takes as input all words (W) and percepts (P) of the current situation, the sets of all previously obtained wordpercept (WP) and percept-word (PW) pairs, the set of auxiliary words (AW), and the set of permanent phrases (PP) and returns the sets of grounded words (GW) and percepts (GP). return GW ∪ GP 12: end procedure Thus, a novel auxiliary word detection algorithm (Algorithm 2) is proposed to handle the simpler sentences employed in this study 4 , while no phrase detection is used to ensure a fair comparison with the baseline framework (Section 5), which does not have any phrase detection capabilities. The rest of this section provides an overview of the employed grounding algorithm. For each situation all corresponding words and percepts are given to the grounding algorithm (Algorithm 1), while the sets of grounded words (GW) and percepts (GP) are initially empty. Before the actual grounding procedure, words that are part of known phrases will be combined so that they can be grounded together and auxiliary words are automatically detected and removed (Algorithm 2). Afterwards, all possible word-percept (WP) and percept-word (PW) pairs are created, i.e. for each word and percept a set containing all percepts and words they occurred with is created, and saved together with a number indicating how often the pair occurred. The highest word-percept pair is determined and saved to the set of grounded words (GW). All other word-percept pairs the word or Algorithm 2 The auxiliary word detection procedure takes as input the sets of word and percept occurrences (WO and PO), and the set of detected auxiliary words (AW). return AW 8: end procedure percept are part of will no longer be considered for the selection of the highest word-percept pair in future iterations. This restriction is applied until all percepts have been used once for grounding. Afterwards, if some words have not been grounded, all percepts will become again available for grounding until all words have been grounded to allow grounding of synonyms. After all words have been grounded the same process is repeated for percept-word pairs to assign synonymous percepts to the same word. Finally, the sets of grounded words and percepts are merged.

Baseline Framework
The baseline framework consists of three parts: (1) 3D object segmentation component as described in Section (4.1), (2) Action recording component as described in Section (4.2), and (3) Bayesian learning model, which identifies auxiliary words and grounds non-auxiliary words and phrases through corresponding percepts. Since the perceptual data extraction components are the same for both frameworks, any difference in grounding performance can only be due to the different grounding algorithms, i.e. component three and four of the proposed framework (Sections 4.3 and 4.4) and component three of the baseline framework, which is described in the remainder of this section. The probabilistic learning model, described in this section, is based on the model used in , since the experimental setup employed in this study (Section 6) is also based on the scenario used in . In general, the model has been chosen as a baseline because similar models have previously been em-

Parameter
Definition λ Hyperparameter of the distribution π w α s , α c , α a Hyperparameters of the distributions π s , π c and π a m i Modality index of each word (modality index ∈ {Shape, Color, Action, AW}) Z s , Z c , Z a Indices of shape, color and action distributions w i Word indices s, c, a Observed states representing shapes, colors and actions γ Hyperparameter of the distribution θ m,Z β s , β c , β a Hyperparameters of the distributions φ s , φ c and φ a θ m,Z Word distribution over modalities ployed in similar grounding scenarios by different researchers, e.g. (Kollar et al., 2010;Tellex et al., 2011;). In the model, the observed state w i represents word indices, i.e. each individual word is represented by a different integer 5 . The observed state s represents the shape of objects, more specifically their geometric characteristics expressed through VFH descriptors (Section 4.1), c represents the color of objects and a represents actions.
(1) The latent variables of the Bayesian learning model are inferred using the Gibbs sampling algorithm (Geman and Geman, 1984) (Algorithm 3), which repeatedly samples from and updates the posterior distributions (Equation 2). Distributions were sampled for 100 iterations, after which convergence had been achieved.

Experimental Setup
The experimental scenario used in this study is based on the scenario used in . The main difference is the use of an additional modality, i.e. color, which leads to slightly different sentences. During the experiment a human tutor and HSR robot 6 interact in front 6 The Human Support Robot from Toyota, which is used for the experiment, has an omnidirectional movable cylindrical Algorithm 3 Inference of the model's latent variables. In this study, nr of iterations was set to 100.

8:
Z c , Z a , m i 9: end procedure of a table, with one of the following five objects {BOTTLE, CUP, BOX, CAR, and BOOK} (Figure 1). Each interaction follows below procedure: 1. The human tutor places an object on the table and the robot determines the object's geometric characteristics and color to create corresponding feature vectors (Section 4.1).
2. An instruction, which describes how to manipulate the object, is given to the robot by the human tutor, e.g. "please lift up the red soda".
3. The human tutor teleoperates the robot to execute the action provided through the instruction while several kinematic characteristics are recorded and converted into an action feature vector (Section 4.2).
A total of 125 interactions were performed to record perceptual information for all combinations of employed shapes, colors, and actions. Since instruction words were selected randomly for each situation, except that words had to fit the encountered percepts, their number of occurrences in the data varies, e.g. the word "coffee" only occurs once, while the word "brown" occurs 14 times. Grounding was then performed for ten different interaction sequences, i.e. the order of the recorded situations was randomly changed, to ensure that the performance is not due to the specific order in which situations are encountered. Figure (3) shows how often each word occurred on average in all interactions as well as the training and test interactions.
shaped body with one arm and gripper. It is equipped with a variety of different sensors, such as stereo and wide-angle cameras, and has 11 degrees of freedom.  Table 3: Explanations of the employed action percepts.

Lift up
The object will be grabbed and lifted up. Grab The object will be grabbed, but not displaced. Push The object will be pushed with the closed gripper without being grabbed first. Pull The object will be grabbed and moved towards the robot.

Move
The object will be grabbed and moved away from the robot.
Each sentence consists of one of the following structures: "action the color shape" or "please action the color shape", where action, color, and shape are substituted by one of their corresponding words (Table 2). Each action and color can be referred to by two different words, while each shape has five corresponding words. During training and testing the obtained situations are given to the proposed and baseline frameworks. The former framework gets the situations separately one after the other, as if it is processing the data in real-time during the interaction. It first clusters the percepts of the current situation together with all previously encountered percepts to obtain abstract representations of shapes, colors and actions (Section 4.3). Afterwards, the CSL based grounding algorithm is used to ground words through the obtained cluster numbers (Section 4.4). In contrast, the baseline framework does not allow online learning and requires all sentences and corresponding percepts of the training situations to be given at once to the learning model.

Results and Discussion
The proposed cross-situational learning framework (Section 4) is evaluated through a human-robot interaction scenario (Section 6) and the obtained grounding results are compared to the groundings achieved by an unsupervised Bayesian grounding framework (Section 5). Figure (4) shows how  the mean number of correct and false mappings changes, when the proposed grounding framework encounters the employed situations one after the other. It also shows that all 45 correct mappings are obtained, when all 125 situations are used for training, while on average only 43 correct mappings are obtained, when only 60% of the situations are used for training. The figure also illustrates the online grounding capability of the model, i.e. that it updates its mappings with every new encountered situation, as well as its transparency because it allows to check at any time through which percept a word is grounded at that moment. Based on the collected co-occurrence information it would also be possible to calculate a confidence score for every mapping to understand how likely it is that a false mapping disappears or a correct mapping persists. The described transparency of the proposed framework can be helpful to understand and debug responses to instructions provided by a human, (a) All situations used for training and testing.
(b) 60% of the situations used for training and 40% for testing. Figure 5: Mean grounding accuracy results and corresponding standard deviations for all modalities and both models. Additionally, the percentage of sentences for which all words were correctly grounded is shown.
when the framework is used to control an artifical agent interacting with a human, especially when the responses are incorrect or inappropriate.
In contrast, the baseline model requires an explicit training phase so that no corresponding figure, illustrating the number of correct and false mappings, can be created. Thus, to allow a comparison between the two models, the mappings of the proposed model are extracted after 125 and 75 situations, depending on the used train/test split. Two different train/test splits are analyzed in this study. For the first split, all situations are used for training and testing to see how well the frameworks perform when all test situations have been encountered before. For the second split, only 60% of the used situations are provided for training, while the remaining situations are used for testing. In this case, it is possible that some words never occur during training or only a limited number of times, e.g. once or twice. If a word does not occur during training, the proposed model is not able to obtain a corresponding mapping which leads to an accuracy of 0% as shown in Figure (6c) for the words coffee and sweets, which both only exist once in the dataset and are thus only present during train-ing or testing, but not both. The word accuracies shown in Figure (6) were calculated by dividing the number of times a word was correctly grounded through the number of times the word was encountered during testing. Similar to the proposed model, the baseline model was also not able to ground the words coffee and sweets correctly, when only 60% of the situations were used for training. However, the baseline model also seems to require in general a higher minimum number of occurrences to successfully ground words, since there are many words that achieved a mean accuracy of 0%, when only 60% of the situations were used for training ( Figure 6d). Figure (5a) shows that the proposed model achieves perfect grounding, when the same situations are provided for training and testing, which confirms that it is able to obtain all correct mappings as shown in Figure (4). However, if only 60% of the situations are used for training and the remaining 40% unknown situations for testing the grounding accuracy drops for both models. For the proposed model the largest accuracy decrease is seen for auxiliary words, while still more than 95% of the obtained shape, color and action groundings are correct. For the baseline framework the largest drop in accuracy is seen for shapes, from more than 95% to less than 2%. The reason might be that every shape word has 5 synonyms, thus, if words would be equally distributed among all situations and specifically among the training and test sets, the decrease might not be as sharp. However, Figure 3 shows that the number of occurrences is not necessarily the reason for the drop because the words bmw and narnia occured on average 7 and 2.5 times during training, respectively, and narnia achieved an accuracy of about 5%, while the accuracy of bmw was 0% (Figure 6d). In contrast, the proposed model shows a more stable performance, since it was able to ground all non-auxiliary words that occured at least one time during training with a mean accuracy of more than 70%, while only the auxiliary word please achieved a lower mean accuracy of 30%. Overall the evaluation shows that the proposed model outperforms the baseline model based on its auxiliary word detection and grounding accuracy. Interestingly, the performance difference is larger, when only 60% of the situations are used for training, although this scenario is artificially harming the proposed model by preventing it to learn during (a) Proposed model using all situations for training and testing. (b) Baseline model using all situations for training and testing.
(c) Proposed model using 60% of the situations for training and 40% for testing.
(d) Baseline model using 60% of the situations for training and 40% for testing. testing, since it does not require explicit training. In addition to the better grounding performance, the proposed model is also more transparent, which becomes important when robots are interacting with humans in complex and unrestricted environments, especially if some actions of the robots can cause harm to humans.

Conclusions and Future Work
This paper investigated a multimodal framework for grounding synonymous shape, color and action words through the visual perception and proprioception of a robot during its interaction with a human tutor. The cross-situational learning model was set up to learn the meaning of shape and color words of objects as well as action words using geometric characteristics and color information of objects obtained from point cloud information as well as kinematic features of the robot joints recorded during action execution. The proposed model allowed auxiliary word detection and online grounding of synonyms through real percepts in an unsupervised manner and without the use of any syntactic or semantic information. Additionally, it outperformed the baseline model based on the accuracy of the obtained groundings, its capability to process new situations online and its transparency. In future work, different mechanisms will be investigated to improve the sample efficiency of the algorithm, which will become relevant, if a larger number of words is used or words occur less often. Additionally, it will be verified whether the framework can handle homonyms. Finally, supervised grounding methods will be integrated so that the robot is able to use human feedback, but does not require it.