Stating the Obvious: Extracting Visual Common Sense Knowledge

Obtaining common sense knowledge using current information extraction techniques is extremely challenging. In this work, we instead propose to derive simple common sense statements from fully annotated object detection corpora such as the Microsoft Common Objects in Context dataset. We show that many thousands of common sense facts can be extracted from such corpora at high quality. Furthermore, using WordNet and a novel sub-modular k-coverage formulation, we are able to generalize our initial set of common sense assertions to unseen objects and uncover over 400k potentially useful facts.


Introduction
How can we discover that bowls can hold broccoli, that if a knife touches a cake then a person is probably cutting cake, or that cutlery can be on dining tables? We propose to leverage the effort of computer vision researchers in creating large scale datasets for object detection and use these resources instead to extract symbolic representations of visual common sense. The knowledge we compile is physical, not commonly covered in text and more exhaustive than what people can usually produce.
Our focus is particularly on visual common sense, defined as the information about spatial and functional properties of entities in the world. We propose to extract three types of knowledge from the Microsoft Common Objects in Context dataset (Lin et al., 2014) (MS-COCO), consisting of 300,000 images, covering 80 objects, with object segments and natural language captions. First, we find spatial relations, e.g. holds(bed, dog), from outlines of co-occurring objects. Next, we construct entailment rules like holds(bed, dog) ⇒ laying-on(dog, bed) by associating spatial relations with text in captions. Finally, we uncover general facts such as holds(furniture, domestic animal), applicable to object types not present in MS-COCO by using WordNet (Miller, 1995) and a novel submodular kcoverage formulation.
Evaluations using crowdsourcing show our methods can discover many thousands of high quality explicit statements of visual common sense. While some of this knowledge can be potentially extracted from text (Vanderwende, 2005), we found that from our top 100 extracted spatial relations, e.g. holds(bed, dog), only 4 are present in some form in the AtLocation relations in the popular Con-ceptNet (Speer and Havasi, 2013) knowledge base. This shows that the knowledge we derive provides complimentary information for other more general knowledge bases. Such common sense facts have proved useful for query expansion (Kotov and Zhai, 2012;Bouchoucha et al., 2013) and could benefit entailment (Dagan et al., 2010), grounded entailment (Bowman et al., 2015), or visual recognition tasks (Zhu et al., 2014).

Related Work
Common sense knowledge has been predominately created directly from human input or extracted from text (Lenat et al., 1990;Liu and Singh, 2004;Carlson et al., 2010). In contrast, our work is focused on visual common sense extracted from images anno-a touches b b touches a a holds b b inside a a besides b b besides a a above b b under a b on a a disconnected from b b disconnected from a Less than 10 % of pixels from a overlap with b.
The angle between the centroid of a and the centroid of b lies between 315° and 45°, or 135° and 225°.
The angle between the centroid of a and the centroid of b lies between 225° and 315°, or 45° and 135°.
The entire extents of object b are inside the extents of object a.
There is more than 50% of pixels in object b overlapping with the extents of object a. There has also been recent interest in the vision community to build databases of visual common sense knowledge. Efforts have focused on a small set of relations, such as similar to or part of (Chen et al., 2013). Webly supervised techniques (Divvala et al., 2014;Chen et al., 2013) have also been used to test whether a particular objectrelation-object triplet occurs in images (Sadeghi et al., 2015). In contrast, we use seven spatial relations and allow natural language relations that represent a larger array of higher level semantics. We also leverage existing efforts on annotating large scale image datasets instead of relying on the noisy outputs of a computer vision system. On a technical level, our methods for extracting common sense facts from images rely on Pointwise-Mutual Information (PMI), analogous to other rule extraction systems based on text (Lin and Pantel, 2001;Schoenmackers et al., 2010). We view objects as an analogy for words, images as documents, and object-object configurations as typed bigrams. Our methods for generalizing relations are inspired by work that tries to predict a class label for an image given a hierarchy of concepts (Deng et al., 2012;Ordonez et al., 2013;Ordonez et al., 2015). Yet our work is the first to deal with visual relations between pairs of concepts in the hierarchy by using a sub-modular formulation that maximizes the amount of coverage of subordinate categories while avoiding contradictions with an initial set of discovered common-sense assertions.

Methods
We assume the availability of an object-level annotated image dataset D containing a set of images with textual descriptions. Each object in an image must be annotated with: (1) a mask or polygon outlining the extents of the object, and (2) the category of the object from a set of categories V and (3) an overall description of the image.
We produce three types of common sense facts, each with an associated scoring function: (1) Object-object relationships implicitly encoded in the relative configurations between objects in the annotated image data, i.e. on(bed, dog) (sec 3.1) , (2) Entailment relations encoded in the relationships between object-object configurations and textual descriptions i.e. on(bed, dog) ⇒ laying-on(bed, dog) (sec 3.2), and (3) Generalized relations induced by using the semantic hierarchy of concepts in Word-Net, i.e. on(furniture , domestic-animal) (sec 3.3).

Mining Object-Object Relations
Our objective in this section is to score and rank a set of relations S 1 = {r(o 1 , o 2 )}, where r is a object-object relation and o 1 , o 2 ∈ V , using a function γ 1 : S 1 → R. First, we define a vocabulary R of object-object relations between pairs of annotated objects. Our relations are inspired by Region Connection Calculus (Randell et al., 1992), and the Visual Dependency Grammar of (Elliott et al., 2014;Elliott and de Vries, 2015), details in Figure 1.
For every image, we record the instances of each of these object-object relations r(o 1 , o 2 ) between all co-occurring objects in D 1 . We use Point-wise Mutual Information (PMI) to estimate the evidence for  Figure 2: Example of our extracted object-object relations. The first column contains the overall 3 best and worst relations ranked by PMI, the following columns show similar results for the queries: what does a person hold? what holds a person?, and what interacts with a frisbee? each relationship triplet: We estimate these probabilities by counting object-object-relation co-ocurrences using existential quantifiers for every image. This means every image can at most contribute one to the count of r(o 1 , o 2 ) so that we do not exacerbate the results by images with many identical object types taken from unusual viewpoints. In Figure 2, we provide examples of our extracted object-object relations.

Mining Entailment Relations
In this section we combine our relation-based tuples mined from visual annotations (section 2) with more than 400k textual descriptions included in MS-COCO. We generate a set of entailments S 2 = {r(o 1 , o 2 ) ⇒ z}, where r(o 1 , o 2 ) is an element from S 1 and z is a consequent obtained from textual descriptions. Similarly as in the previous section, we rank the relations in S 2 using a function γ 2 : S 2 → R.
We start by generating an exhaustive list of candidate consequents z. We first pre-process the image captions with the part-of-speech tagger and lemmatizer from the Stanford Core NLP toolkit (Manning et al., 2014), and remove stop words. Then we generate a list of n-length skipgrams in each caption. The set of n-skipgrams are filtered based on predefined lexical patterns 2 , and redundancies 2 noun, verb , noun,*, verb,*, noun , noun,*, preposition, *, noun , noun,*, verb, preposition,*,noun are removed 3 . Skipgrams, z, are then paired with co-occurring relations, r(o 1 , o 2 ), removing pairs with the disconnected-from spatial relation (see Figure 1). Pairs are scored with the conditional probability: The consequent z can take the form q, q(o 1 ), q(o 2 ), or q(o 1 , o 2 ), by performing a simple alignment with the arguments in the antecedent. We perform this alignment by mapping the object categories in the antecedent r(o 1 , o 2 ) to WordNet synsets, and matching any word in z to any word in the gloss set of the predicate arguments o 1 and o 2 . The unmatched words in z form the relation, whereas matched words form arguments. We produce the form q if there are no matches, q(o 1 ), or q(o 2 ) when one argument word matches, and q(o 1 , o 2 ) when both match. Examples of discovered entailments are in Figure 3.

Generalizing Relations using WordNet
In this section we present an approach to generalize an initial set of relations, S, to objects not found in the original vocabulary V . Using WordNet we construct a superset G containing all possible parent relations for the relations in S by replacing their arguments o 1 , o 2 by all their possible hypernyms. Our objective is to select a subset T from G that contains high quality and diverse generalized relations.  Note that elements in G can be too general and contradict statements in S while others could be correct but add little new knowledge. To balance these concerns, we formulate the selection as an optimization problem by maximizing a fitness function L: where ψ is a coverage term that computes the total number of facts implied through hyponym relationships by the elements in T . The second term φ is a consistency term that measures the compatibility of a generalized relation t with the relations in S. We assume that if a relation is missing from S, then it is false (this corresponds to a closed world assumption over the domain of S). Thus, φ is the ratio of the scores of relations in S consistent with relation t (i.e. evidence for t based on S), and a value that is proportional to the number of missing relations from S (i.e. the amount of counter-evidence). More concretely: where µ is a constant and d is the product of the WordNet distances of the synsets involved in t to their nearest synset in S. This penalizes relations that are far away from categories in S. The optimization defined in Equation 3 is an instance of the submodular k-coverage problem. We use a greedy algorithm that adds elements to T that maximize L, which due to the submodular nature of the problem approximates the solution up to a constant factor.

Experimental Setup
Object-Object Relations: We filter out from the initial set of candidate relations the ones that occur less than 20 times. We extract more than 3.1k unique statements (6k including symmetric spatial relations). Entailment Relations: We use skipgrams of length 2-6 allowing at most 6 skips, filter candidates such that they occur at least 5 times, and return the top 10 most likely entailments per spatial relation. Overall, 6.3k unique statements are extracted (10k including symmetric relations). Generalized Relations: We optimize Equation 4 only for object-object relations because the closed world assumption makes counts for implications sparse. The parameter µ is set to the average of the scores, λ = 0.05 and k = 200.

Evaluation
We evaluated the quality of the common sense we derive on Amazon Mechanical Turk. Annotators are presented with possible facts and asked to grade statements on a five point scale. Each fact was evaluated by 10 workers and we normalize their average responses to a scale from 0 to 1. Figure 4 shows plots of quality vs. coverage, where coverage means the top percent of relations sorted by our predicted quality scores.
Object-Object Relations As a baseline, 1000 randomly sampled relations have a quality of 0.225. Figure 4a shows our PMI measure ranks many high quality facts at the top, with the top quintile of the ranking being rated above 0.63 in quality. Facts about persons are higher quality, likely because this category is in over 50% of the images in MS-COCO.  Entailment Relations Turkers were instructed to assign the lowest score when they could not understand the consequent of the entailment relation. As a baseline, 1000 randomly sampled implications that meet our patterns have a quality of 0.33. Figure 4b shows that extracting high quality entailment is harder than object-object relations likely because supposition and consequent need to coordinate. Relations involving furniture are rated higher and manual inspection revealed that many relations about furniture imply stative verbs or spatial terms.
Generalized Relations To evaluate generalizations, Figure 4c, we also present users with definitions 4 . As a baseline, 200 randomly sampled generalizations from our 3k object-object relations have a quality of 0.53. Generalizations we find are high quality and cover over 400k objects facts not present in MS-COCO. Examples from the 200 we derive include: holds(dining-table, cutlery), holds(bowl, edible fruit) or on(domestic animal, bed).

Conclusion
In this work, we use an existing object detection dataset to extract 16k common sense statements about annotated categories. We also show how to generalize using WordNet and induced hundreds of thousands of facts about unseen objects. The information we extracted is visual, large scale and good quality. It has the potential to be useful for both visual recognition and entailment applications.