Verb Physics: Relative Physical Knowledge of Actions and Objects

Learning commonsense knowledge from natural language text is nontrivial due to reporting bias: people rarely state the obvious, e.g., “My house is bigger than me.” However, while rarely stated explicitly, this trivial everyday knowledge does influence the way people talk about the world, which provides indirect clues to reason about the world. For example, a statement like, “Tyler entered his house” implies that his house is bigger than Tyler. In this paper, we present an approach to infer relative physical knowledge of actions and objects along five dimensions (e.g., size, weight, and strength) from unstructured natural language text. We frame knowledge acquisition as joint inference over two closely related problems: learning (1) relative physical knowledge of object pairs and (2) physical implications of actions when applied to those object pairs. Empirical results demonstrate that it is possible to extract knowledge of actions and objects from language and that joint inference over different types of knowledge improves performance.

In this paper, we present an approach to infer relative physical knowledge of actions and objects along five dimensions (e.g., size, weight, and strength) from unstructured natural language text. We frame knowledge acquisition as joint inference over two closely related problems: learning (1) relative physical knowledge of object pairs and (2) physical implications of actions when applied to those object pairs. Empirical results demonstrate that it is possible to extract knowledge of actions and objects from language and that joint inference over different types of knowledge improves performance.

Introduction
Reading and reasoning about natural language text often requires trivial knowledge about everyday physical actions and objects. For example, given a sentence "Shanice could fit the trophy into the suitcase," we can trivially infer that the trophy must be smaller than the suitcase even though it is not stated explicitly. This reasoning requires knowledge about the action "fit"-in particular, typical preconditions that need to be satisfied in order to perform the action. In addition, reasoning about the applicability of various physical actions in a given situation often requires background knowledge about objects in the world, for example, that people are usually smaller than houses, that cars generally move faster than humans walk, or that a brick probably is heavier than a feather.
In fact, the potential use of such knowledge about everyday actions and objects can go beyond language understanding and reasoning. Many open challenges in computer vision and robotics may also benefit from such knowledge, as shown in recent work that requires visual reasoning and entailment (Izadinia et al., 2015;Zhu et al., 2014). Ideally, an AI system should acquire such knowledge through direct physical interactions with the world. However, such a physically interactive system does not seem feasible in the foreseeable future.
In this paper, we present an approach to acquire trivial physical knowledge from unstructured natural language text as an alternative knowledge source. In particular, we focus on acquiring relative physical knowledge of actions and objects organized along five dimensions: size, weight, strength, rigidness, and speed. Figure 1 illustrates example knowledge of (1) relative physical relations of object pairs and (2) physical implications of actions when applied to those object pairs. While natural language text is a rich source to obtain broad knowledge about the world, compiling trivial commonsense knowledge from unstructured text is a nontrivial feat. The central challenge lies in reporting bias: people rarely states the obvious (Van Durme, 2010;Sorower et al., 2011;Gordon and Van Durme, 2013;Misra et al., 2016;Zhang et al., 2017), since it goes against Grice's conversational maxim on the quantity of information (Grice, 1975).
In this work, we demonstrate that it is possible to overcome reporting bias and still extract the unspoken knowledge from language. The key insight is this: there is consistency in the way people describe how they interact with the world, which provides vital clues to reverse engineer the common knowledge shared among people. More concretely, we frame knowledge acquisition as joint inference over two closely related puzzles: inferring relative physical knowledge about object pairs while simultaneously reasoning about physical implications of actions.
Importantly, four of five dimensions of knowledge in our study-weight, strength, rigidness, and speed-are either not visual or not easily recognizable by image recognition using currently available computer vision techniques. Thus, our work provides unique value to complement recent attempts to acquire commonsense knowledge from web images (Izadinia et al., 2015;Bagherinezhad et al., 2016;.
In sum, our contributions are threefold: • We introduce a new task in the domain of commonsense knowledge extraction from language, focusing on the physical implications of actions and the relative physical relations among objects, organized along five dimensions. • We propose a model that can infer relations over grounded object pairs together with first order relations implied by physical verbs. • We develop a new dataset VERBPHYSICS that compiles crowdsourced knowledge of actions and objects. 1 The rest of the paper is organized as follows. We first provide the formal definition of knowledge we aim to learn in Section 2. We then describe our data collection in Section 3 and present our inference model in Section 4. Empirical results are given in Section 5 and discussed in Section 6. We review related work in Section 7 and conclude in Section 8.

Knowledge Dimensions
We consider five dimensions of relative physical knowledge in this work: size, weight, strength, rigidness, and speed. "Strength" in our work refers to the physical durability of an object (e.g., "diamond" is stronger than "glass"), while "rigidness" refers to the physical flexibility of an object (e.g., "glass" is more rigid than a "wire"). When considered in verb implications, size, weight, strength, and rigidness concern individual-level semantics; the relative properties implied by verbs in these dimensions are true in general. On the other hand, speed concerns stage-level semantics; its implied relations hold only during a window surrounding the verb (Carlson, 1977).

Relative physical knowledge
Let us first consider the problem of representing relative physical knowledge between two objects. We can write a single piece of knowledge like "A person is larger than a basketball" as person > size basketball Any propositional statement can have exceptions and counterexamples. Moreover, we need to cope with uncertainties involved in knowledge acquisition. Therefore, we assume each piece of knowledge is associated with a probability distribution. x threw y x is larger than y x is heavier than y x is slower than y "We walked into the house" x walked into y x is smaller than y x is lighter than y x is faster than y I agent,goal "I squashed the bug with my boot" squashed x with y x is smaller than y x is lighter than y x is weaker than y I theme,goal x is less rigid than y x is slower than y More formally, given objects x and y, we define a random variable O a x,y whose range is {>, <, } with respect to a knowledge dimension a ∈ {SIZE,WEIGHT,STRENGTH,RIGIDNESS,SPEED} so that: This immediately provides two simple properties:

Physical Implications of Verbs
Next we consider representing relative physical implications of actions applied over two objects. For example, consider an action frame "x threw y." In general, following implications are likely to be true: Again, in order to cope with exceptions and uncertainties, we assume a probability distribution associated with each implication. More formally, we define a random variable F a v to denote the implication of the action verb v when applied over its arguments x and y with respect to a knowledge dimension a so that: where the range of F size threw is {>, <, }. Intuitively, F size threw represents the likely first order relation implied by "throw" over ungrounded (i.e., variable) object pairs.
The above definition assumes that there is only a single implication relation for any given verb with respect to a specific knowledge dimension. This is generally not true, since a verb, especially a common action verb, can often invoke a number of different frames according to frame semantics (Fillmore, 1976). Thus, given a number of different frame relations v 1 ...v T associated with a verb v, we define random variables F with respect to a specific frame relation v t , i.e., F a vt . We use this notation going forward.
Frame Perspective on Verb Implications: Figure 2 illustrates the frame-centric view of physical implication knowledge we aim to learn. Importantly, the key insight of our work is inspired by Fillmore's original manuscript on frame semantics (Fillmore, 1976). Fillmore has argued that "frames"-the contexts in which utterances are situated-should be considered as a third primitive of describing a language, along with a grammar and lexicon. While existing frame annotations such as FrameNet (Baker et al., 1998), PropBank (Palmer et al., 2005), and VerbNet (Kipper et al., 2000) provide rich frame knowledge associated with a predicate, none of them provide the exact kind of physical implications we consider in our paper. Thus, our work can potentially contribute to these resources by investigating new approaches to automatically recover richer frame knowledge from language. In addition, our work is motivated by the formal semantics of Dowty (1991), as the task of learning verb implications is essentially that of extracting lexical entailments for verbs.

Data and Crowdsourced Knowledge
Action Verbs: We pick 50 classes of Levin verbs from both "alternation classes" and "verb classes" (Levin, 1993), which corresponds to about 1100 unique verbs. We sort this list by frequency of occurrence in our frame patterns in the Google Syntax Ngrams corpus (Goldberg and Orwant, 2013) and pick the top 100 verbs.
Action Frames: Figure 2 illustrates examples of action frame relations. Because we consider implications over pairwise argument relations for each frame, there are sometimes multiple frame relations we consider for a single frame. To enumerate action frame relations for each verb, we use syntactic patterns based on dependency parse by extracting the core components (subject, verb, direct object, prepositional object) of an action, then map the subject to an agent, the direct object to a theme, and the prepositional object to a goal. 2 For those frames that involve an argument in a prepositional phrase, we create a separate frame for each preposition based on the statistics observed in the Google Syntax Ngram corpus.
Because the syntax ngram corpus provides only tree snippets without context, this way of enumerating potential frame patterns tend to overgenerate. Thus we refine our prepositions for each frame by taking either the intersection or union with the top 5 Google Surface Ngrams (Michel et al., 2011), depending on whether the frame was under-or over-generating. We also add an additional crowdsourcing step where we ask crowd workers to judge whether a frame pattern with a particular verb and preposition could plausibly be found in a sentence. This process results in 813 frame templates, an average of 8.13 per verb.  Frames are partitioned by verb. Counts are shown for usable data, which includes only ≥ 2/3 agreement and removes all with "no relation." Each prediction task (frames or object pairs) is given 5% of that domain's data as seed. We compare models using either 5% or 20% of the other domain's data as seed.
Object Pairs: To provide a source of ground truth relations between objects, we select the object pairs that occur in the 813 frame templates with positive pointwise mutual information (PMI) across the Google Syntax Ngram corpus. After replacing a small set of "human" nouns with a generic HUMAN object, filtering out nouns labeled as abstract by WordNet (Miller, 1995), and distilling all surface forms to their lemmas (also with WordNet), the result is 3656 object pairs.

Crowdsourcing Knowledge
We collect human judgements of the frame knowledge implications to use as a small set of seed knowledge (5%), a development set (45%), and a test set (50%). Crowd workers are given with a frame template such as "x threw y," and then asked to list a few plausible objects (including people and animals) for the missing slots (e.g., x and y). 3 We then ask them to rate the general relationship that the arguments of the frame exhibit with respect to all knowledge dimensions (size, weight, etc.). For each knowledge dimension, or attribute, a, workers select an answer from (1) x > a y, (2) x < a y, (3) x a y, or (4) no general relation. We conduct a similar crowdsourcing step for the set of object pairs. We ask crowd workers to compare each of the 3656 object pairs along the five knowledge dimensions we consider, selecting an answer from the same options above as with frames. We reserve 50% of the data as a test set, and split the remainder up either 5% / 45% or 20% / 30% (seed / development) to investigate the effects of different seed knowledge sizes on the model.
Statistics for the dataset are provided in Table 1. About 90% of the frames as well as object pairs had 2/3 agreement between workers. After removing frame/attribute combinations and object pairs that received less than 2/3 agreement, or were selected by at least 2/3 workers to have no relation, we end up with roughly 400-600 usable frames and 2100-2500 usable object pairs per attribute.

Model
We model knowledge acquisition as probabilistic inference over a factor graph of knowledge. As shown in Figure 3, the graph consists of multiple substrates (page-wide boxes) corresponding to different knowledge dimensions (shown only three of them -strength, size, weight-for brevity). Each substrate consists of two types of sub-graphs: verb subgraphs and object subgraphs, which are connected through factors that quantify action-object compatibilities. Connecting across substrates are factors that model inter-dependencies across different knowledge dimensions. In what follows, we describe each graph component.

Nodes
The factor graph contains two types of nodes in order to capture two classes of knowledge. The first type of nodes are object pair nodes. Each object pair node is a random variable O a x,y which captures the relative strength of an attribute a between objects x and y.
The second type of nodes are frame nodes. Each frame node is a random variable F a vt . This corresponds to the verb v used in a particular type of frame t, and captures the implied knowledge the frame v t holds along an attribute a.
All random variables take on the values {>, <, }. For an object pair node O a x,y , the value represents the belief about the relation between x and y along the attribute a. For a frame node F a vt , the value represents the belief about the relation along the attribute a between any two objects that might be used in the frame v t .
We denote the sets of all object pair and frame random variables O and F, respectively.

Action-Object Compatibility
The key aspect of our work is to reason about two types of knowledge simultaneously: relative knowledge of grounded object pairs, and implications of actions related to those objects. Thus we connect the verb subgraphs and object subgraphs through selectional preference factors ψ s between two such nodes O a x,y and F a vt if we find evidence from text that suggests objects x and y are used in the frame v t . These factors encourage both random variables to agree on the same value.
As an example, consider a node O size p,b which represents the relative size of a person and a basketball, and a node F size threw dobj which represents the relative size implied by an "x threw y" frame. If we find significant evidence in text that "[person] threw [basketball]" occurs, we would add a selectional preference factor to connect O size p,b with F size threw dobj and encourage them towards the same value. This means that if it is discovered that people are larger than basketballs (the value >), then we would expect the frame "x threw y" to entail x > size y (also the value >).

Semantic Similarities
Some frames have relatively sparse text evidences to support their corresponding knowledge acquisition. Thus, we include several types of factors based on semantic similarities as described below.
Cross-Verb Frame Similarity: We add a group of factors ψ v between two verbs v and u (to connect a specific frame of v with a corresponding frame of u) based on the verb-level similarities.
Within-Verb Frame Similarity: Within each verb v, which consists of a set of frame relations v 1 , ...v T , we also include frame-level similarity factors ψ f between v i and v j . This gives us more evidence over a broader range of frames when textual evidence might be sparse.  (left), is improved by modeling their interplay (orange). Unary seed (ψ seed ) and embedding (ψ emb ) factors are omitted for clarity.
Object Similarity: As with verbs, we add factors ψ o that encourage similar pairs of objects to take the same value. Given that each node represents a pair of objects, finding that x and y are similar yields two main cases in how to add factors (aside from the trivial case where the variable O a x,y is given a unary factor to encourage the value ).

Cross-Knowledge Correlation
Some knowledge dimensions, such as size and weight, have a significant correlation in their implied relations. For two such attributes a and b, if the same frame F a v i and F b v i exists in both graphs, we add a factor ψ a between them to push them towards taking the same value.

Seed Knowledge
In order to kick off learning, we provide a small set of seed knowledge among the random variables in {O, F} with seed factors ψ seed . These unary seed factors push the belief for its associated random variable strongly towards the seed label.

Potential Functions
Unary Factors: For all frame and object pair random variables in the training set, we train a maximum entropy classifier to predict the value of the variable. We then use the probabilities of the classifier as potentials for seed factors given to all random variables in their class (frame or object pair). Each log-linear classifier is trained separately per attribute on a featurized vector of the variable: The feature function is defined differently according to the node type: f (O a p,q ) := g(p), g(q) f (F a vt ) := h(t), g(v), g(t)  Table 2: Accuracy of baselines and our model on both tasks. Top: frame prediction task; bottom: object pair prediction task. In both tasks 5% of in-domain data (frames or object pairs, respectively) are available as seed data. We compare providing the other type of data (object pairs or frames, respectively) as seed knowledge, trying 5% (OUR MODEL (A)) and 20% (OUR MODEL (B)).
Here g(x) is the GloVe word embedding (Pennington et al., 2014) for the word x (t is the frame relation's preposition, and g(t) is simply set to the zero vector if there is no preposition) and h(t) is a one-hot vector of the frame relation type. We use GloVe vectors of 100 dimensions for verbs and 50 dimensions for objects and prepositions (the dimensions picked based on development set).

Binary Factors:
In the case of all other factors, we use a "soft 1" agreement matrix with strong signal down the diagonals:

Inference
After our full graph is constructed, we use belief propagation to infer the assignments of frames and object pairs not in our training data. Each message µ is a vector where each element is the probability that a random variable takes on each value x ∈ {>, <, }. A message passed from a random variable v to a neighboring factor f about the value x is the product of the messages from its other neighboring factors about x: A message passed from a factor f with potential ψ to a random variable v about its value x is a marginalized belief about v taking value x from the other neighboring random variables combined with its potential: After stopping belief propagation, the marginals for a node can be computed and used as a decision for that random variable. The marginal for v taking value x is the product of its surrounding factors' messages:

Experimental Results
Factor Graph Construction: We first need to pick a set of frames and objects to determine our set of random variables. The frames are simply the subset of the frames that were crowdsourced in the given configuration (e.g., seed + dev), with "soft 1" unary seed factors (the gold label indexed row of the binary factor matrix) given only to those in the seed set. The same selection criteria and seed factors are applied to the crowdsourced object pairs.
For lexical similarity factors (ψ v , ψ o ), we pick connections based on the cosine similarity scores of GloVe vectors thresholded above a value chosen based on development set performance. Attribute similarity factors (ψ a ) are chosen based on sets of frames that reach largely the same decisions on the seed data (95%). Frame similarity factors (ψ f ) are added to pairs of frames with linguistically similar constructions. Finally, selectional preference  Figure 4: Example model predictions on dev set frames. The model's confidence is shown by the bars on the right. The correct relation is highlighted in orange (6-10 are failure cases for the model). If there are two blanks, the relation is between them. If there is only one blank, the relation is between PERSON and the blank. Note that receives minuscule weight because it is never the correct value for frames in the seed set.
factors (ψ s ) are picked by using a threshold (also tuned on the development set) of pointwise mutual information (PMI) between the frames and the object pairs' occurrences in the Google Syntax Ngram corpus.
For each task, we consider the set of factors to include in each model a hyperparameter, which is also tuned on the development set.
Baselines: Baselines include making a RAN-DOM choice, picking between >, <, and ), picking the MAJORITY label, and a maximum entropy classifier based on the embedding representations (EMB-MAXENT) defined in Section 4.6.
Inferring Knowledge of Actions: Our first experiment is to predict knowledge implied by new frames. In this task, 5% of the frames are available as seed knowledge. We experiment with two different sets of seed knowledge for the object pair data: OUR MODEL (A) uses only 5% of the object pair data as seed, and OUR MODEL (B) uses 20%.
The full results for the baseline methods and our model are given in the upper half of Table 2. Our model outperforms the baselines on all attributes except for the speed, which has a highly skewed label distribution to allow the majority baseline to  Table 3: Ablation results on size attribute for the frame prediction task on the development dataset for OUR MODEL (A) (5% of the object pairs as seed data). We find that different graph configurations improve performance for different tasks and data amounts. In this setting, frame and attribute similarity factors hindered performance.
perform well. Ablations are given in Table 3, and sample correct predictions from the development set are shown in examples 1-5 of Figure 4.
Inferring Knowledge of Objects: Our second experiment is to predict the correct relations of new object pairs. The data for this task is the inverse of before: 5% of the object pairs are available as seed knowledge, and we experiment with both 5% (OUR MODEL (A)) and 20% (OUR MODEL (B)) frames given as seed data. Again, both are independently tuned on the development data. Results for this task are presented in the lower half of Table 2. While OUR MODEL (A) is competitive with the strongest baseline, introducing the additional frame data allows OUR MODEL (B) to reach the highest accuracy.

Discussion
Metaphorical Language: While our frame patterns are intended to capture action verbs, our templates also match senses of those verbs that can be used with abstract or metaphorical arguments, rather than directly physical ones. One example from the development set is "x contained y." While x and y can be real objects, more abstract senses of "contained" could involve y as a "forest fire" or even a "revolution." In these instances, x > size y is plausible as an abstract notion: if some entity can contain a revolution, we might think that entity as "larger" or "stronger" than the revolution.
Error analysis: Examples 6-10 in Figure 4 highlight failure cases for the model. Example 6 shows a case where the comparison is nonsensical because "for" would naturally be followed by a purpose ("He drove the car for work.") or a duration ("She drove the car for hours.") rather than a concrete object whose size is measurable. Example 7 highlights an underspecified frame. One crowd worker provided the example, "PERSON stopped the fly with {the jar / a swat-ter}," where fly < weight {jar, swatter}. However, two crowd workers provided examples like "PERSON stopped their car with the brake," where clearly car > weight brake. This example illustrates complex underlying physics we do not model: a brake-the pedal itself-is used to stop a car, but it does so by applying significant force through a separate system. The next two examples are cases where the model nearly predicts correctly (8, e.g., "She lived at the office.") and is just clearly wrong (9, e.g., "He snipped off a locket of hair"). Example 10 demonstrates a case of polysemy where the model picks the wrong side. In the phrase, "She caught the runner in first,", it is correct that she > speed runner. However, the sense chosen by the crowd workers is that of, "She caught the baseball," where indeed she < speed baseball.

Related work
Several works straddle the gap between IE, knowledge base completion, and learning commonsense knowledge from text. Earlier works in these areas use large amounts of text to try to extract general statements like "A THING CAN BE READABLE" (Gordon et al., 2010) and frequencies of events (Gordon and Schubert, 2012). Our work focuses on specific domains of knowledge rather than general statements or occurrence statistics, and develops a frame-centric approach to circumvent reporting bias. Other work uses a knowledge base and scores unseen tuples based on similarity to existing ones (Angeli and Manning, 2013;Li et al., 2016), or extends it by inferring new facts from unstructured text using natural language inference (Angeli and Manning, 2014). Zhang et al. (2017) predict the likelihood of entailed commonsense statements extracted from a large text corpus. In contrast to the above, our work seeks to induce several novel types of graded physical knowledge which lack existing databases.
A number of recent works combine multimodal input to learn visual attributes (Bruni et al., 2012;Silberer et al., 2013), extract commonsense knowledge from web images (Tandon et al., 2016), and overcome reporting bias (Misra et al., 2016). In contrast, we focus on natural language evidence to reason about attributes that are both in (size) and out (weight, rigidness, etc.) of the scope of computer vision. Yet other works mine numerical attributes of objects (Narisawa et al., 2013;Takamura and Tsujii, 2015;Davidov and Rappoport, 2010) and comparative knowledge from the web (Tandon et al., 2014). Our work uniquely learns verb-centric lexical entailment knowledge.
A handful of works have attempted to learn the types of knowledge we address in this work. One recent work tried to directly predict several binary attributes (such "is large" and "is yellow") from on off-the-shelf word embeddings, noting that accuracy was very low (Rubinstein et al., 2015). Another line of work addressed grounding verbs in the context of robotic tasks. One paper in this line acquires verb meanings by observing state changes in the environment (She and Chai, 2016). Another work in this line does a deep investigation of eleven verbs, modeling their physical effect via annotated images along eighteen attributes (Gao et al., 2016). These works are encouraging investigations into multimodal groundings of a small set of verbs. Our work instead grounds into a fixed set of attributes but leverages language on a broader scale to learn about more verbs in more diverse set of frames. In this, our work can be seen as exploring predicate lexical semantics in the vein of semantic proto-roles (Dowty, 1991;Kako, 2006;Reisinger et al., 2015), but instead affording pairwise, physical relations between roles .

Conclusion
We presented a novel take on verb-centric frame semantics to learn implied physical knowledge latent in verbs. Empirical results confirm that by modeling changes in physical attributes entailed by verbs together with objects that exhibit these properties, we are able to better infer new knowledge in both domains.