Commonsense inference in human-robot communication

Natural language communication between machines and humans are still constrained. The article addresses a gap in natural language understanding about actions, specifically that of understanding commands. We propose a new method for commonsense inference (grounding) of high-level natural language commands into specific action commands for further execution by a robotic system. The method allows to build a knowledge base that consists of a large set of commonsense inferences. The preliminary results have been presented.


Introduction
There is a significant progress in movement from early natural language understanding computer programs like SHRDLU (Winograd, 1972) with its deterministic actions in the virtual world to modern cognitive robots operating in the physical world and mapping language to actions. Artificial agents enter our lives and the end users of such systems are not technical experts. The only way for them to communicate with AI is to use natural language. For example, humans can give a natural language command expecting a follow-up action by the agent.
Nowadays in robotics, in order to execute a natural language command which is considered as a high-level instruction, an agent needs to transform it to a sequence of lower-level primitive actions (Figure 1.). For example, the industrial arm SCHUNK has three primitives: open-gripper, close-gripper, move-to and for this agent any highlevel command should be transformed into a sequence of these 3 actions to be performed (Kress-Gazit et al., 2008). For smarter agents with more primitives, complicated commands like fill up the cup with water can be executed by transformation into a long sequence of the lower-level actions: pick up the cup, move to your left, put the cup under the faucet, turn on the faucet, turn off the faucet, etc. In other words, natural language command decomposition is a necessary step for an agent to be capable of executing.
To make such transformations possible, previous works (Misra et al., 2015;She and Chai, 2016) explicitly model verbs with predicates describing the resulting states of actions. Their empirical evaluations have demonstrated how incorporating result states into verb representations can link language with underlying planning modules for robotic systems (Gao et al., 2016). Recent investigations use reinforcement learning to transform language commands into primitive actions (Misra et al., 2017) or representation of actions (Arumugam et al., 2017).
The current studies in human-robot communication (She and Chai, 2017;Chai et al., 2018) show that natural language understanding of commands is difficult for machines because commands in human-human communications are usually expressed through a desired change of state.

Problem Statement
As Rappaport Hovav and Levin (2010) pointed out, any action can be expressed in two different ways. Firstly, there are manner verbs that describe how actions are carried out -i.e. manners of doing: hit, stab, scrub, sweep, wipe, yell, etc. Secondly, there are verbs that describe results of an action or a change of state: break, clean, crush, destroy, shatter, etc.
Further we will use a term "action verb" as a synonym for a manner verb and a term "result verb" as a synonym for a verb that describes a result of an action or a change of state. For commands in human-human communication, people mostly use result verbs. We say open the door, not push the door; clean the table, not wipe the table.
It should be underlined that result verbs don't express any concrete action. For instance, the command open the door represents a particular kind of change of state in an entity but it is silent about how this change comes about. The verb clean doesn't indicate whether it was done by sweeping, wiping, washing or sucking; the same way the verb kill does not indicate how a killing was done 1 .
On the contrary, the action verbs in the commands pull the door, push the door, kick the door, etc. represent different kinds of action necessary to implement the change of state open the door.
The obvious question arises: if a command is expressed through a desired change of state, how humans know what actions to do? The point is that humans derive the information about the concrete actions related to the desired change from shared background knowledge about the world. There is no need to explicitly represent it in human communication. It is commonsense knowledge that enables us to understand each other (Clark, 1996;Tomasello, 2008) and to know how to open the door or how to clean the table (see Figure 2.).
AI systems, even new generations of cognitive agents, have significantly less knowledge about the world and are not able to ground result-verb commands into action-verb commands. A command with a result verb does not give AI any information on what actions should be performed to achieve the disable change of state. As a result of that, commands to robots are directly linked to primitive actions implemented by a robot without the intermediate step of identifying them with action verbs (see Figure 1.).
The straightforward approach "command → primitive actions" fails to achieve two significant 1 The separation of verbs on action verbs and result verbs got further elaboration in cognitive science where an event representation is considered to be based on 2-vector structure model: a force vector representing the cause of a change and a result vector representing a change in object properties (Gardenfors, 2017;. It is argued that this framework gives a cognitive explanation for manner verbs as force vectors and for result verbs as result vectors. points. First, a result verb being applied to the same object can be executed by different action-verb commands. For instance, the command with the result verb fill up (the cup with water) can be executed by the action verb pour (water into the cup) or by the action verb scoop (water from the bucket).
Second, a result verb being applied to different objects assumes different action verbs. The general problem of overcoming the gap in human-robot natural language understanding being applied to the high-level natural language commands can be formulated the following way. How can AI systems transform high-level natural language commands with result verbs into commands with action verbs 2 ?

Related Work
Although commonsense inference between action verbs and result verbs has been described in linguistic studies (Rappaport Hovav and Levin, 2010), there is still a lack of detailed account of potential causality that could be denoted by an action verb (Gao et al., 2016).
From the AI domain, there were investigations devoted to learning the physics of the world from videos (Fire and Zhu, 2016) and simulations (Wu et al., 2017). However, except for a few works that explored the physical properties of verbs (Forbes and Choi, 2017;Zellers and Choi, 2017), how verbs and their corresponding actions affect the state of the physical world is still largely underexplored.
Well-known knowledge bases like Freebase, YAGO or DBPedia, even being automatically populated by modern NLP methods, do not contain commonsense inferences we are going to create.
Crowd-sourcing resources such as ConceptNet have an incomplete coverage, which is its main drawback. A human knowledge engineer may not list all possible events related to a particular action verb or a result verb. For example, the inference scrub → clean might be listed while others such as mop → clean, suck → clean, or sweep → clean might be missed.
Existing linguistic resources such as Propbank, FrameNet or VerbNet provide important information about verb classification, its arguments and semantic roles, but they do not distinguish action verbs and result verbs. For instance, in the largest domain-independent computational verb lexicon VerbNet (Kipper Schuler, 2005), that provides semantic role representation for 6394 verbs (version 3.2b), the action verb hit and the result verb break have the same structure: [Agent, Instrument, Patient, Result]. Even if the semantic representation for a verb may indicate that a change of state is involved, it does not provide the specifics associated with the verb's meaning (e.g., to what attribute of its patient the changes might occur) (Gao et al., 2016).
WordNet, manually created by professional linguists, to the best of our knowledge, is the only linguistic resource that partly provides information about causal links between action verbs and result verbs. As we will indicate below, these links overlap with the hypernym-hyponym relations in WordNet.
Finally, the broad-coverage resource VerbOcean (Chklovski and Patel, 2004) set a semantic relation "enablement" between verbs using the following 4 patterns: "Xed * by Ying the"; "Xed * by Ying or"; "to X * by Ying the" and "to X * by Ying or", where "X" and "Y" are verbs; (*) matches any single word. The patterns are similar to the one we are going to use. The only signifi- cant difference is that all of them do not include a noun after a verb "X". As it was mentioned in the section 3 (2nd point), a result verb being applied to different objects assumes different action verbs.

Proposed Approach
We consider the transformation formulated in section 3 as a process of grounding where a high-level command representing a desired change of state is grounded to an action(s) command.
The following two assumptions will be made to formalize the process of grounding.
1) The commands in human-robot interactions can occur in various forms and patterns. Some of them can be rather complicated. Our work addresses the simplest case where a command is represented by the structure V+NP, where V is a verb, NP is a noun phrase.
2) The grounding of a result-verb command into an action-verb command is represented as: V r +NP 1 +by+V a +NP 2 , where V r is a result verb; V a is an action verb 3 .
Since a result verb being applied to the same object can be executed by different action-verb commands, the schema on the Figure 2. will be unfolded as one-to-multi relations between a resultverb command and an action-verb command (see Figure 3.).
The key point here is how to extract one-tomulti relations. In reality, these relations are commonsense inferences that allow humans easily to transform result-verb commands into action-verb commands. These commonsense inferences are so obvious and so well-known to everybody that are very rarely expressed anywhere in a written form. It makes it hard to find and extract from any source of information. As a consequence of that, we cannot apply deep learning techniques for extraction of above-mentioned one-to-multi relations. Deep learning has proved incredibly powerful and effective for many practical tasks from perceptual classification to self-driving cars. But we have to acknowledge the data-hungry nature of systems based on deep learning. The side-effect of that is a long tail of low-frequency data that cannot be treated the same way. Our research deals with such data.
The method suggested for one-to-multi relations extraction is based on 3 non-related approaches and includes three steps accordingly.

Getting 2 sets of verbs: a set of result verbs
{V r } and a set of action verbs {V a }; 2. Getting a set of the most frequent pairs {V r +NP}; 3. Getting a set of commonsense inferences {V r +NP 1 +by+V a +NP 2 }.
In the first step, result verbs (V r ) and action verbs (V a ) are separated. The separation is based on analysis of Wordnet; this is a domainindependent step that aims to cover generally result and action verbs representing the physical world. In the second step, the set of the most frequent pairs {V r +NP} are extracted using the N-gram approach to form result-verb commands: clean the floor, cool the beer, etc. In the third step, we use a search engine to check all around the web if there is a commonsense inference between an action-verb command and a result-verb command (open the door by pressing the button). If a commonsense inference exists in the web it is considered as being validated and added to the set.

Step #1: Getting Two Sets of Verbs
The output of the step #1 is two sets of verbs: a set of action verbs {V a } and a set of result verbs {V r }. The separation is based on the analysis of the entire set of verbs through Princeton Word-Net (WN) (Fellbaum, 1998) which is widely used in a variety of tasks related to extraction of semantic relations. The verb part of WN contains 11529 unique verbs (version WN 3.0) 4 . They are organized in verb synsets ordered mainly by troponym-hypernym hierarchical relations (Fellbaum and Miller, 1990). According to the definitions, a hypernym is a verb with a more generalized meaning, while a troponym replaces the hypernym by indicating a manner of doing something. The closer a verb is to the bottom of a verb tree, the more specific the manners that are expressed by troponyms: communicate-talkwhisper 5 .
Meanwhile, action verbs are hidden in the WN verb structure since troponyms are not always action verbs. In some troponym-hypernym relations the verbs are in fact action verbs like in {kill}-{drown}. However, there are no explicit ways to extract them yet.
The idea is that action verbs can be extracted from WN if at least one of four conditions, applied to a verb is valid 6 :

A verb in WN is an action verb if its hyper-
nym is an action verb. In other words, once a verb is an action verb, all branches located below consist of action verbs as well, regardless of their glosses.
The procedure of using conditions 1-4 goes from all top verbs to the bottom verbs. For ex-4 https://wordnet.princeton.edu/documentation/wnstats 7wn. The following paper (McCrae et al., 2019) outlines a roadmap for adding new entries to WordNet, so the number of verbs is not fixed, but increasing over time. 5 Note that these are defined on verb-senses, not verbs. For example, the verb see "perceive: I see the picture" will behave differently from the verb see "understand: I see the problem". 6 These 4 conditions elaborate the approach developed in (Huminski and Zhang, 2018a,b) ample, we start from the top synset {change, alter, modify} (gloss: cause to change; make different; cause a transformation). It doesn't satisfy the 1st or the 2nd condition, so we go down on 1 level and examine one of its troponyms: {clean, make clean} (make clean by removing dirt, filth, or unwanted substances from). It is still not an action verb synset: in the pattern from the 1st condition -"V + by [...]ing" -the verb make clean is not a hypernym. On the next level there are synsets with glosses that satisfy either the 1st or the 2nd condition: • {sweep} (clean by sweeping); • {brush} (clean with a brush); • {steam, steam clean} (clean by means of steaming).
So, the verbs sweep, brush, steam, steam clean are action verbs. Applying the 3rd condition on them, one can state that all synsets located below these 3 synsets (if any) are action verb synsets. The framework is the basis of the procedure for action extraction. We implemented the procedure following the conditions 1.-4. and got the following results: 3. 1408 verb synsets have been extracted from the motion lexicographer file; 4. a total of 3063 verb synsets have been extracted as a total number of action verbs including all the verb synsets that are located under the hypernyms as action verbs; 3063 extracted verb synsets contain 3294 unique action verbs.
All other verbs are potentially result verbs. Also some restrictions need to be applied to consider only the result and action verbs that are represented in the physical world and necessary for robot actions.
We will evaluate the results intrinsically (a linguist will judge the validity), and extrinsically, i.e. for English verbs also found in Levin's English Word Classes and Alternations (1992) we will compare our results to her classes. For example, class 10.3 "clear" verbs (clean, clear, drain, empty) are result verbs while 10.4.1 "wipe" verbs (bail, buff, dab, distill, dust, erase, expunge, flush, leach, lick ..) are action verbs.

Step #2: Getting Set of Pairs {V r +NP}
The output of the step #2 is a set of the most frequent (commonly used) pairs {V r +NP}. The purpose of this step is based on the observation that a result verb being applied to different objects assumes different action-verb commands.
To generate the set {V r +NP} we use N-grams (which are a contiguous sequence of n items from a given text) extracted from the largest publiclyavailable, genre-balanced corpus of English: the Corpus of Contemporary American English 7 with about 430 million words in size. With this Ngrams data (2, 3, 4, 5-word sequences, with their frequency), the subset of N-grams are extracted where the 1st word is a result verb in any grammatical form. A threshold was set for the frequency.

Step #3: Getting a Set of Commonsense Inferences
The output of the final step #3 is a set of commonsense inferences between an action-verb command and a result-verb command validated by a search engine from the web. The search engine is used to check (validate) if a commonsense inference exists in the web. Each commonsense inference for the checking has a structure V r +NP 1 +by+V a +NP 2 (open the door by pressing the button).
The procedure is the following: 1. make a cartesian multiplication of pairs {V r +NP} and action verbs {V a }: {(V r +NP), V a }; 2. create a sequence for each element from 1.: V r +NP+by+V a (fill the cup by pouring); 3. run the sequence from 2. on the search engine looking for the sequence V r +NP 1 +by+V a +NP 2 (concrete object) in the web. Estimate the frequency (or getting no result).
4. If we do not find sufficient action-verb templates V a +NP 2 (concrete object), we will use the learned combinations to learn new templates, extending the approach (Snow et al., 2006) to learning wordnet relations.
All validated commonsense inferences will be added to the set with frequencies and stored.

Implementation and Preliminary Results
The flowchart (Fig. 4) shows the general approach of causal relations extraction from text. Three modules on the bottom in grey color represent three steps from section 5. The details of the approach are given below. Raw data. WordNet is used as raw data. Algorithm of separation. For getting preliminary results, commonly used result verbs and action verbs were taken from the linguistic literature. We extracted 12 result verbs and 50 action verbs. Result verbs: break, clean, clear, close, raise, cut, fill, heat, kill, lift, open, remove. Action verbs: blow, brush, chip, chop, clip, comb, compress, drown, flap, grab, grasp, grind, grip, hack, hammer, hit, kick, knead, lever, mow, pound, pour, press, pull, push, rinse, rub, saw, scoop, scrape, scratch, scribble, scrub, shake, shave, shoot, shovel, slap, slash, smear, soap, splash, sponge, squeeze, stab, steam, sweep, touch, wash, wipe. N-gram approach. For each of 12 result verbs, we extracted five 3-grams V r +NP. Each 3-gram contains the most frequent noun phrase with the corresponding verb. Totally 60 3-grams were extracted (see Table 1 for details).
Web-search. Cartesian multiplication of 60 3grams and 50 action verbs produces 3000 combinations "V r +NP by V a ". We use search engine Bing for running the template "V r +NP by V a ...". Accordingly, 3000 searches were made. The results were taken and analyzed from the first 10 web pages that appeared. We were looking for the results corresponding the template "V r +NP by V a +NP/Pronoun".
Results. As a result we got 497 causal relations. Sample of 20 extracted causal relations is given in Table 2.
Examples of causal relations for the 3-gram "open the window" is given in Table 3.

Evaluation
The evaluation was based on a sample of 100 causal relations randomly taken from extracted 497 ones.
Due to the restrictions applied on event and causal relation between events we can not evaluate the recall of the extraction.
The precision (validity) of extracted causal relations were evaluated by five human judges. They were given instructions to rate the causal relations by marking each relation with a number from 1 (very bad) to 5 (very good). Examples of invalid (break the ice by seeing it) and valid (opened the box by pulling on the handle) causal extractions were provided.

Simple Average
After 5 judges put their marks, the simple average was calculated by dividing the sum of all marks by 500. We got 3.1.

Extraction of valid causal relations
We calculated the average between judges for each causal relation and extracted 62 causal relations (among 100 randomly taken) with average score more or equal 3.

Analysis of invalid causal relations
38 causal relations with the average score lower than 3 were preliminarily analyzed for detecting the reasons. We found the following: a) bad parsing or bad POS tagging (kill the bacteria by pouring a half cup; fill the hole by pushing thousands; open the window by grabbing the opening); b) unusual causal relations that require a context: heat the oil by pressing the palms; cut the engine by pulling both paddles. c) meaningless causal relations: break the ice by seeing it; killing each other by slashing the rate;

Conclusion and Further Work
Commonsense inferences allow us to equip and empower cognitive robots with an ability to understand high-level natural language commands (or instructions). We present a method for acquir- ing the knowledge needed to transform high-level result-verb commands into action-verb commands for further implementation into primitive actions.
In the future, to improve the results and increase the quality of retrieved actions we are planning to: • improve the instruction for judges to decrease the deviation in evaluation; • use better NLP tools for POS tagging and parsing; • develop more elaborated procedure for commonsense inferences, for example, to exclude search results with negation ("don't open the window by throwing the stone") that produce wrong commonsense inferences; • use metrics for calculation of consistency (reliability) of the results (for example, Krippendorff's alpha coefficient); • enlarge the set of verbs used for commonsense inferences using resource such as WordNets.