Automatic Annotation of Structured Facts in Images

Motivated by the application of fact-level image understanding, we present an automatic method for data collection of structured visual facts from images with captions. Example structured facts include attributed objects (e.g.,), actions (e.g.,), interactions (e.g.,), and positional information (e.g.,). The collected annotations are in the form of fact-image pairs (e.g.,and an image region containing this fact). With a language approach, the proposed method is able to collect hundreds of thousands of visual fact annotations with accuracy of 83% according to human judgment. Our method automatically collected more than 380,000 visual fact annotations and more than 110,000 unique visual facts from images with captions and localized them in images in less than one day of processing time on standard CPU platforms.


Introduction
People generally acquire visual knowledge by exposure to both visual facts and to semantic or language-based representations of these facts, e.g., by seeing an image of "a person petting dog" and observing this visual fact associated with its language representation .In this work, we focus on methods for collecting structured facts that we define as structures that provide attributes about an object, and/or the actions and interactions this object may have with other objects.We introduce the idea of automatically collecting annotations for second order visual facts and third order visual facts where second order facts <S,P> are at-tributed objects (e.g., <S: car, P: red>) and singleframe actions (e.g., <S: person, P: jumping>), and third order facts specify interactions (i.e., <boy, petting, dog>).This structure is helpful for designing machine learning algorithms that learn deeper image semantics from caption data and allow us to model the relationships between facts.In order to enable such a setting, we need to collect these structured fact annotations in the form of (language view, visual view) pairs (e.g., <baby, sitting on, chair> as the language view and an image with this fact as a visual view) to train models.(Chen et al., 2013) showed that visual concepts, from a predefined ontology, can be learned by querying the web about these concepts using image-web search engines.More recently, (Divvala et al., 2014) presented an approach to learn concepts related to a particular object by querying the web with Google-N-gram data that has the concept name.There are three limitations to these approaches.(1) It is difficult to define the space of visual knowledge and then search for it.It is further restricting to define it based on a predefined ontology such as (Chen et al., 2013) or a particular object such as (Divvala et al., 2014).(2) Using image search is not reliable to collect data for concepts with few images on the web.These methods assume that the top retrieved examples by imageweb search are positive examples and that there are images available that are annotated with the searched concept.(3) These concepts/facts are not structured and hence annotations lacks information like "jumping" is the action part in <person, jumping >, or "man' and "horse" are interacting in <person, riding, horse >.This structure is important for deeper understanding of visual data, which is one of the main motivations of this work.
The problems in the prior work motivate us to propose a method to automatically annotate structured facts by processing image caption data since Figure 1: Structured Fact Automatic Annotation facts in image captions are highly likely to be located in the associated images.We show that a large quantity of high quality structured visual facts could be extracted from caption datasets using natural language processing methods.Caption writing is free-form and an easier task for crowd-sourcing workers than labeling second-and third-order tasks, and such free-form descriptions are readily available in existing image caption datasets.We focused on collecting facts from the MS COCO image caption dataset (Lin et al., 2014) and the newly collected Flickr30K entities (Plummer et al., 2015).We automatically collected more than 380,000 structured fact annotations in high quality from both the 120,000 MS COCO scenes and 30,000 Flickr30K scenes.
The main contribution of this paper is an accurate, automatic, and efficient method for extraction of structured fact visual annotations from image-caption datasets, as illustrated in Fig. 1.Our approach (1) extracts facts from captions associated with images and then (2) localizes the extracted facts in the image.For fact extraction from captions, We propose a new method called SedonaNLP for fact extraction to fill gaps in existing fact extraction from sentence methods like Clausie (Del Corro and Gemulla, 2013).Se-donaNLP produces more facts than Clausie, especially <subject,attribute> facts, and thus enables collecting more visual annotations than using Clausie alone.The final set of automatic annotations are the set of successfully localized facts in the associated images.We show that these facts are extracted with more than 80% accuracy according to human judgment.

Motivation
Our goal by proposing this automatic method is to generate language&vision annotations at the factlevel to help study language&vision for the sake of structured understanding of visual facts.Existing systems already work on relating captions directly to the whole image such as (Karpathy et al., 2014;Kiros et al., 2015;Vinyals et al., 2015;Xu et al., 2015;Mao et al., 2015;Antol et al., 2015;Malinowski et al., 2015;Ren et al., 2015).This gives rise to a key question about our work: why it is useful to collect such a large quantity of structured facts compared to caption-level systems?
We illustrate the difference between captionlevel learning fact-level learning that motivates this work by the example in (1) From the language view, the annotations we generate is precise to list a particular fact (e.g., <bicycle,parked between, parking posts>).
(2) From the visual view, it provide the bounding box of this fact; see Fig 1 .(3) A third unique part about our annotations is the structure: e.g., <bicycle,parked between, parking posts> instead of "a bicycle parked between parking posts".
Our collected data has been used to develop methods that learn hundreds of thousands of image facts, as we introduced and studied in (Mohamed Elhoseiny, 2016).The results shows that fact-level learning is superior compared to caption-level learning like (Kiros et al., 2015), as shown in Table 4 in (Mohamed Elhoseiny, 2016) (16.39% accuracy versus 3.48% for (Kiros et al., 2015)).It further shows the value of the associated structure in the (16.39% accuracy versus 8.1%) in Table 4(Mohamed Elhoseiny, 2016)).Similar results also shown on a smaller scale in Table 3 in (Mohamed Elhoseiny, 2016).
We propose a two step automatic annotation of structured facts: (i) Extraction of structured fact from captions, and (ii) Localization of these facts in images.First, the captions associated with the given image are analyzed to extract sets of clauses that are considered as candidate <S,P>, and <S,P,O> facts.
Captions can provide a tremendous amount of information to image understanding systems.However, developing NLP systems to accurately and completely extract structured knowledge from free-form text is an open problem.We extract structured facts using two methods: Clausie (Del Corro and Gemulla, 2013) and Sedona( detailed later in Sec 4); also see Fig 1 .We found Clausie (Del Corro and Gemulla, 2013) missed many visual facts in the captions which motivated us to develop Sedona to fill this gap as detailed in Sec. 4.
Second, we localize these facts within the image (see Fig. 1).The successfully located facts in the images are saved as fact-image annotations that could be used to train visual perception models to learn attributed objects, actions, and interactions.We managed to collect 380.409 highquality second-and third-order fact annotations (146,515 from Flickr30K Entities, 157,122 from the MS COCO training set, and 76,772 from the MS COCO validation set).We present statistics of the automatically collected facts in the Experiments section.Note that the process of localizing facts in an image is constrained by information in the dataset.
For MS COCO, the dataset contains object annotations for about 80 different objects as provided by the training and validation sets.Although this provides abstract information about objects in each image (e.g., "person"), it is usually mentioned in different ways in the caption.For the "person" object, "man", "girl", "kid", or "child" could instead appear in the caption.In order to locate second-and third-order facts in images, we started by defining visual entities.For the MS COCO dataset (Lin et al., 2014), we define a visual entity as any noun that is either (1) one of the MS COCO dataset objects, (2) a noun in the WordNet ontology (Miller, 1995;Leacock and Chodorow, 1998) that is an immediate or indirect hyponym of one of the MS COCO objects (since WordNet is searchable by a sense and not a word, we perform word sense disambiguation on the sentences using a state-of-the-art method (Zhong and Ng, 2010)), or (3) one of scenes the SUN dataset (Xiao et al., 2010) (e.g., a "restaurant").We expect visual entities to appear either in the S or the O part (if exists) of a candidate fact.This allows us to then localize facts for images in the MS COCO dataset.Given a candidate third-order fact, we first try to assign each S and O to one of the visual entities.If S and O elements are not visual entities, then the fact is ignored.Otherwise, the facts are processed by several heuristics, detailed in Sec 5.For instance, our method takes into account that grounding the plural "men" in the fact <S:men, P: chasing, O: soccer ball > may require the union of multiple "man" bounding boxes.
In the Flickr30K Entities dataset (Plummer et al., 2015), the bounding box annotations are presented as phrase labels for sentences (for each phrase in a caption that refers to an entity in the scene).A visual entity is considered to be a phrase with a bounding box annotation or one of the SUN scenes.Several heuristics were developed and applied to collect these fact annotations, e.g.grounding a fact about a scene to the entire image; detailed in Sec 5.

Fact Extraction from Captions
We extract facts from captions using Clausie (Del Corro and Gemulla, 2013) and our proposed SedonaNLP system.In contrast to Clausie, we address several challenging linguistic issues by evolving our NLP pipeline to: 1) correct many common spelling and punctuation mistakes, 2) resolve word sense ambiguity within clauses, and 3) learn a common spatial preposition lexicon (e.g., "next to", "on top of", "in front of") that consists of over 110 such terms, as well as a lexicon of over two dozen collection phrase adjectives (e.g., "group of", "bunch of", "crowd of", "herd of").For our purpose, these strategies allowed us to extract more interesting structured facts that Clausie fails at which include (1) more discrimination between single versus plural terms, (2) extracting positional facts (e.g., next to).Additionally, SedonaNLP produces attribute facts that we denote as <S, A>; see Varying degrees of success have been achieved in extracting and representing structured triples from sentences using <subject, predicate, object> triples.For instance, (Rusu et al., 2007) describe a basic set of methods based on traversing the parse graphs generated by various commonly available parsers.Larger scale text mining methods for learning structured facts for question answering have been developed in the IBM Watson PRISMATIC framework (Fan et al., 2010).While parsers such as CoreNLP (Manning et al., 2014) are available to generate comprehensive dependency graphs, these have historically required significant processing time for each sentence or have traded accuracy for performance.In contrast, SedonaNLP currently employs a shallow dependency parsing method that runs in some cases 8-9X faster than earlier cited methods running on identical hardware.We choose a shallow approach with high, medium, and low confidence cutoffs after observing that roughly 80%  below to reduce higher occurrence errors due to systematic parsing errors: (i) Mapping past participles to adjectives (e.g., stained glass), (ii) Denesting existential facts (e.g., this is a picture of a cat watching a tv.), (iii) Identifying auxiliary verbs (e.g., do verb forms).
In Fig. 4, we show an example of extracted <S,P,O> structured facts useful for image annotation for a small sample of MS COCO captions.Our initial experiments empirically confirmed the findings of IBM Watson PRISMATIC researchers who indicated big complex parse trees tend to have more wrong parses.By limiting a frame to be only a small subset of a complex parse tree, we reduce the chance of error parse in each frame (Fan et al., 2010).In practice, we observed many correctly extracted structured facts for the more complex sentences (i.e., sentences with multiple VX verb expressions and multiple spatial prepositional expressions) -these facts contained useful information that could have been used in our joint learning model but were conservatively filtered to help ensure the overall accuracy of the facts being presented to our system.As improvements are made to semantic triple extraction and confidence evaluation systems, we see potential in several areas to exploit more structured facts and to filter less information.Our full <S,P,O> triple and related tuple extractions for MS COCO and Flickr30K datasets are available in the supplemental material.

Locating facts in the Image
In this section, we present details about the second step of our automatic annotation process introduced in Sec. 3.After the candidate facts are extracted from the sentences, we end up with a set F s = {f i l }, i = 1 : N s for statement s, where N s is the number of extracted candidate fact f i l , ∀i from the statement s using either Clausie (Del Corro and Gemulla, 2013) or Sedona-3.0.The localization step is further divided into two steps.The mapping step maps nouns in the facts to candidate boxes in the image.The grounding step processes each fact associated with the candidate boxes and outputs a final bounding box if localization is successful.The two steps are detailed in the following subsections.

Mapping
The mapping step starts with a pre-processing step that filters out a non-useful subset of F s and produces a more useful set F * s that we try to locate/ground in the image.We perform this step by performing word sense disambiguation using the state-of-the-art method (Zhong and Ng, 2010).The word sense disambiguation method provides each word in the statement with a word sense in the wordNet ontology (Leacock and Chodorow, 1998).It also assigns for each word a part of speech tag.Hence, for each extracted candidate fact in F s we can verify if it follows the expected part of speech according to (Zhong and Ng, 2010).For instance, all S should be nouns, all P should be either verbs or adjectives, and O should be nouns.This results in a filtered set of facts F * s .Then, each S is associated with a set of candidate boxes in the image for second-and third-order facts and each O associated with a set or candidate boxes in the image for third-order facts only.Since entities in MSCOCO dataset and Flickr30K are annotated differently, we present how the candidate boxes are determined in each of these datasets.
MS COCO Mapping: Mapping to candidate boxes for MS COCO reduces to assigning the S for second-order and third-order facts, and S and O for third-order facts.Either S or O is assigned to one of the MSCOCO objects or SUN scenes classes.Given the word sense of the given part (S or O), we check if the given sense is a descendant of MSCOCO objects senses in the wordNet ontology.If it is, the given part (S or O) is associated with the set of candidate bounding boxes that belongs to the given object (e.g., all boxes that contain the "person" MSCOCO object is under the "person" wordnet node like "man", 'girl', etc).If the given part (S or O) is not an MSCOCO object or one of its descendants under wordNet, we further check if the given part is one of the SUN dataset scenes.If this condition holds, the given part is associated with a bounding box of the whole image.
Flickr30K Mapping: In contrast to MSCOCO dataset, the bounding box annotation comes for each entity in each statement in Flickr30K dataset.Hence, we compute the candidate bounding box annotations for each candidate fact by searching the entities in the same statement from which the clause is extracted.Candidate boxes are those that have the same name.Similarly, this process assigns S for second-order facts and assigns S and O for second-and third-order facts.
Having finished the mapping process, whether for MSCOCO or Flickr30K, each candidate fact f i l ∈ F * s , is associated with candidate boxes depending on its type as follows.
<S,P> : Each f i l ∈ F * s of second-order type is associated with one set of bounding boxes b i S , which are the candidate boxes for the S part.b i O could be assumed to be always an empty set for second-order facts.
<S,P,O> : Each f i l ∈ F * s of third-order type is associated with two sets of bounding boxes b i S and b i S as candidate boxes for the S and P parts, respectively.

Grounding
The grounding process is the process of associating each f i l ∈ F * s with an image f v by assigning f l to a bounding box in the given MS COCO image scene given the b i S and b i O candidate boxes.The grounding process is relatively different for the two dataset due to the difference of the entity annotations.

Grounding: MS COCO dataset (Training and Validation sets)
In the MS COCO dataset, one challenging aspect is that the S or O can be singular, plural, or referring to the scene.This means that one S could map to multiple boxes in the image.For example, "people" maps to multiple boxes of "person".Furthermore, this case could exist for both the S and the O.In cases where either S or O is plural, the bounding box assigned is the union of all candidate bounding boxes in b i S .The grounding then proceeds as follows.
<S,P> facts: (1) If the computed b i S = ∅ for the given f i l , then f i l fails to ground and is discarded.(3) If either b i S = ∅ or b i O = ∅, then a bounding box is assigned to the present object (the largest bounding box if singular, or the union of all bounding boxes if plural).If the area of this region compared to the area of the whole scene is greater than a threshold th = 0.3, then the f i v is associated to the whole image of the scene.Otherwise, f i l fails to ground and is ignored.
Grounding: Flickr30K dataset The main difference in Flickr30K is that for each entity phrase in a sentence, there is a box in the image.This means there is no need to have cases for single and plural.Since in this case, the word "men" in the sentence will be associated with the set of boxes referred to by "men" in the sentences.We union these boxes for plural words as one candidate box for "men" We can also use the information that the object box has to refer to a word that is after the subject word, since subject usually occurs earlier in the sentence compared to object.We union these boxes for plural words.
<S,P> facts: If the computed b i S = ∅ for the given f i l , then f i l fails to ground and is discarded.Otherwise, the fact is assigned to the largest candidate box in if there are multiple boxes.<S,P, O> facts: <S,P, O> facts are handled very similar to MSCOCO dataset with two main differences.
a) The candidate boxes are computed as described for the case of Flickr30K dataset.
b) All cases are handled as single case, since even plural words are assigned one box based on the nature of the annotations in this dataset.

Human Subject Evaluation
We propose three questions to evaluate each annotation: (Q1) Is the extracted fact correct (Yes/No)?The purpose of this question is to evaluate errors captured by the first step, which extracts facts by Sedona or Clausie.(Q2) Is the fact located in the image (Yes/No)?In some cases, there might be a fact mentioned in the caption that does not exist in the image and is mistakenly considered as an annotation.(Q3) How accurate is the box assigned to a given fact (a to g)? a (about right), b (a bit big), c (a bit small), d (too small), e (too big), f (totally wrong box), g (fact does not exist or other).Our instructions on these questions to the participants can be found in this anonymous url (Eval, 2016).
We evaluate these three questions for the facts that were successfully assigned a box in the image, because the main purpose of this evaluation is to measure the usability of the collected annotations as training data for our model.We created an Amazon Mechanical Turk form to ask these three questions.So far, we collected a total of 10,786 evaluation responses, which are an evaluation of 3,595 (f v , f l ) pairs (3 responses/ pair).Table 2 shows the evaluation results, which indicate that the data is useful for training, since≈83.1% of them are correct facts with boxes that are either about right, or a bit big or small (a,b,c).We further some evaluation responses that we collected from volunteer researchers in Table 2 showing similar results.
Fig. 6 shows some successful qualitative results that include four extracted structured facts from MS COCO dataset (e.g., <person, using, phone>, <person, standing>, etc).Fig 7 also show a negative example where there is a wrong fact among the extracted facts (i.e., <house, ski>).The main reason for this failure case is that "how" is mistyped as "house"; see Fig 7 .The supplementary materials includes all the captions of these examples and also additional qualitative examples.

Hardness Evaluation of the collected data
In order to study how the method behave in both easy and hard examples.This section present statistics of the successfully extracted facts and re-late it to the hardness of the extraction of these facts.We start by defining hardness of an extracted fact in our case and its dependency on the fact type.Our method collect both second-and third-order facts.We refer to candidate subjects as all instances of the entity in the image that match the subject type of either a second-order fact <S,P> or a third-order fact <S,P,O>.We refer to candidate objects as all instances in the image that match the object type of a third-order fact <S,P,O>.The selection of the candidate subjects and candidate objects is a part of our method that we detailed in Sec 5. We define the hardness for second order facts by the number of candidate subjects and the hardness of third order facts by the number of candidate subjects multiplied by the number of candidate objects.
In Fig 16 and 17, the Y axis is the number of facts for each bin.The X axis shows the bins that correspond to hardness that we defined for both Figure 6: Se6eral Facts successfully extracted by our method from two MS COCO scenes Figure 7: An example where one of the extracted facts are not correct due to a spelling mistake second and third order fats.Figure 16 shows a histogram of the difficulties for all Mturk evaluated examples including both the successful and the failure cases.Figure 17 shows a similar histogram but for but for subset of facts verified by the Turkers with Q3 as (about right).The figures show that the method is able to handle difficulty cases even with more than 150 possibilities for grounding.We show these results broken out for MSCOCO and Flickr30K Entities datasets and for each fact types in the supplementary materials.

Conclusion
We present a new method whose main purpose to collect visual fact annotation by a language approach.The collected data help train visual system systems on the fact level with the diversity of facts captured by any fact described by an image caption.We showed the effectiveness of the proposed methodology by extracting hundreds of thousands of fact-level annotations from MSCOCO and Flickr30K datasets.We verified and analyzed the collected data and showed that more than 80% of the collected data are good for training visual systems.

Supplementary Materials
This document includes several qualitative examples of high order facts automatically collected from MSCOCO dataset and Flickr30 Dataset as well.
1) Qualitative Examples 2) Statistic on the COllected data 3) Extracted Facts and Statistic on them using Sedona for each of MSCOCO train, MSCOCO validation, and Flickr30K datasets.There is an attached zip file for each dataset.

SAFA successful cases
On the left is the input, and on the right is the output.In order to study how the method behave in both easy and hard examples.This section present statistics of the successfully extracted facts and relate it to the hardness of the extraction of these facts.We start by defining hardness of an extracted fact in our case and its dependency on the fact type.Our method collect both second-and third-order facts.We refer to candidate subjects as all instances of the entity in the image that match the subject type of either a second-order fact <S,P> or a third-order fact <S,P,O>.We refer to candidate objects as all instances in the image that match the object type of a third-order fact <S,P,O>.The selection of the candidate subjects and candidate objects is a part of our method that we detailed in Sec. 4 in the paper.We define the hardness for second order facts by the number of candidate subjects and the hardness of third order facts by the number of candidate subjects multiplied by the number of candidate objects.In all the following figures the Y axis is the number of facts for each bin.The X axes are bins that correspond to (1) the number of candidate subjects for second and third order facts.
(2) the number of candidate objects for third order facts.
(3) the number of candidate objects multiplied by number of candidate subjects for third-order facts (which is all possible pairs of entities in an image that match the given <S,P,O> fact) We here show the number of possible candidates as the level of difficulty/hardness on the x axis, y axis is the number of facts in each case.

Statistics on both all MTurk data
compared to the subset where Q3 = about right for each dataset The following figures show statistics on the facts verified by MTurkers (from the union of all datasets).Figure 16   This section shows the number of candidate subject and object statistics for all successfully grounded facts for all MS COCO (union of training and validation subsets) and Flickr30K datasets.SAFA collects second-and third-order facts.We refer to candidate subjects as all instances of the entity that match the subject of either a second-order fact <S,P> or a third-order fact <S,P,O>.We refer to candidate objects as all instances of the entity that match the object of a third-order fact <S,P,O>.The selection of the candidate subjects and candidate objects is a part of our method that we detailed in this paper.
Our method was designed to achieve high precision such that the grounded facts are as accurate as possible as we showed in our experiments.In all the following figures the Y axis is the number of facts for each bin.The X axes are bins that correspond to (1) the number of candidate subjects for second and third order facts.
(2) the number of candidate objects for third order facts.
(3) the number of candidate objects multiplied by number of candidate subjects for third-order facts (which is all possible pairs of entities in an image that match the given <S,P,O> fact) 9.1 Second-Order Facts <S,P>: Candidate Subjects Fig 1. Caption-level learning systems correlate captions like those on top of Fig. 1(top-left) to the whole image that includes all objects.Structured Fact-level learning systems are instead fed with localized annotations for each fact extracted form the image caption; see in Fig. 1(right), Fig. 6, and 7 in Sec. 6. Fact level annotations are less confusing training data than sentences because they provide more precise information for both the language and the visual views.
Figure 3: Accumulative Percentage of SP and SPO facts in COCO 2014 captions as number of verbs increases

Figure 4 :
Figure 4: Examples of caption processing and <S,P,O> and <S,P> structured fact extractions.

Figure 2 :
Figure 2: SedonaNLP Pipeline for Structured Fact Extraction from Captions (2) If S singular, f i v is the image region that with the largest candidate bounding box in b i S .(3)If S is plural, f i v is the image region that with union of the candidate bounding boxes in b i S .<S,P, O> facts: (1) If b i S = ∅ and b i O = ∅, f i l fails to ground and is ignored.(2)If b i S = ∅ and b i O = ∅, then bounding boxes are assigned to S and O such that the distance between them is minimized (though if S or O is plural, the assigned bounding box is the union of all bounding boxes for b i S or b i O respectively), and the grounding is assigned the union of the bounding boxes assigned to S and O.

Figure 8 :
Figure 8: (All MTurk Data) Hardness histogram after candidate box selection using our method

Figure 12 :
Figure 12: Example 3 from MS COCO dataset shows a histogram of the difficulties for all Mturk evaluated examples.Figure 17 shows a similar histogram but for but for subset of facts verified by the Turkers by Q3 as (about right) by MTurker.The figures show that the method is able to handle difficulty cases even with more than 150 possibilities for grounding.

Figure 16 :
Figure 16: (All MTurk Data) Number or Possibilities for grounding each item after Natural Language Processing

Figure 18 :
Figure 18: (Flickr30K Dataset) Hardness of Automatically collected perfect examples (Q3 = about right).(Bottom is the same figure starting from hardness starting from 2 candidates and more)

Table 1 :
Human Subject Evaluation by MTurk workers %

Table 2 :
Human Subject Evaluation by Volunteers % (This is another set of annotations different from those evaluated by MTurkers)