Generating Descriptions of Spatial Relations between Objects in Images

We investigate the task of predicting prepositions that can be used to describe the spatial relationships between pairs of objects depicted in images. We explore the extent to which such spatial prepositions can be predicted from (a) language information, (b) visual information, and (c) combinations of the two. In this paper we describe the dataset of object pairs and prepositions we have created, and report ﬁrst results for predicting prepositions for object pairs, using a Naive Bayes framework. The features we use include object class labels and geometrical features computed from object bounding boxes. We evaluate the results in terms of accuracy against human-selected prepositions.


Introduction
The task we investigate is predicting the prepositions that can be used to describe the spatial relationships between pairs of objects in images. This is not the same as inferring the actual 3-D realworld spatial relationships between objects, but has some similarities with that task. This is an important subtask in automatic image description (which is important not just for assistive technology, but also for applications such as text-based querying of image databases), but it is rarely addressed as a subtask in its own right. If an image description method produces spatial prepositions it tends to be as a side-effect of the overall method (Mitchell et al., 2012;Kulkarni et al., 2013), or else relationships are not between objects, but e.g. between objects and the 'scene' (Yang et al., 2011). An example of preposition selection as a separate sub-task is Elliott & Keller (2013) where the mapping is hard-wired manually.
Our main data source is a corpus of images (Everingham et al., 2010) in which objects have been annotated with rectangular bounding boxes and object class labels. For a subset of 1,000 of the images we also have five human-created descriptions of the whole image (Rashtchian et al., 2010).
We collected additional annotations for the images (Section 2.3) which list, for each object pair, a set of prepositions that have been selected by human annotators as correctly describing the spatial relationship between the given object pair.
The aim is to create models for the mapping from image, bounding boxes and labels to spatial prepositions as indicated in Figure 1. In this we use a range of features to represent object pairs, computed from image, bounding boxes and labels. We investigate the predictive power of different types of features within a Naive Bayes framework (Section 3), and report first results in terms of two measures of accuracy (Section 4).

VOC'08
The PASCAL VOC 2008 Shared Task Competition (VOC'08) data consists of 8,776 images and 20,739 objects in 20 object classes (Everingham et al., 2010). In each image, every object belonging to one of the 20 VOC'08 object classes is annotated with its object class label and a bounding box (among other annotations):
50 images in each of the 20 VOC'08 object classes. Turkers had to have high hit rates and pass a language competence test before creating descriptions, leading to relatively high quality. We obtained a set of candidate prepositions from the VOC'08 1K dataset as follows. We parsed the 5,000 descriptions with the Stanford Parser version 3.5.2 1 with the PCFG model, extracted the nmod:prep prepositional modifier relations, and manually removed the non-spatial ones. This gave us the following set of 38 prepositions: V = { about, above, across, against, along, alongside, around, at, atop, behind, below, beneath, beside, beyond, by, close to, far from, in, in front of, inside, inside of, near, next to, on, on top of, opposite, outside, outside of, over, past, through, toward, towards, under, underneath, up, upon, within }

Human-Selected Spatial Prepositions
We are in the process of extending the VOC'08 annotations with human-selected spatial prepositions associated with pairs of objects in images. So far we have collected spatial prepositions for object pairs in images that have exactly two objects annotated (1,020). Annotators were presented with images from the dataset where in each image presentation the two objects, Obj 1 and Obj 2 , were shown with their bounding boxes and labels. If there was more than one object of the same class, then the labels were shown with subscript indices (where objects are numbered in order of decreasing size of area of bounding box).
Next to the image was shown the template sentence "The Obj 1 is the Obj 2 ", and the list of possible prepositions extracted from VOC 1K (see 1 http://nlp.stanford.edu/software/lex-parser.shtml preceding section). The option 'NONE' was also available in case none of the prepositions was suitable (participants were discouraged from using it).
Each template sentence was presented twice, with the objects once in each order, "The Obj 1 is the Obj 2 " and "The Obj 2 is the Obj 1 ". 2 Participants were asked to select all correct prepositions for each pair.
The following table shows occurrence counts for the 10 most frequent object labels:

Predicting Prepositions
When looking at a 2-D image, people infer all kinds of information not present in the pixel grid on the basis of their practice mapping 2-D information to 3-D spaces, and their real-world knowledge about the properties of different types of objects. In our research we are interested in the extent to which prepositions can be predicted without any real-world knowledge, using just features that can be computed from the objects' bounding boxes and labels. In this section we explore the predictive power of language and visual features within a Naive Bayes framework: where v j ∈ V are the possible prepositions, and F is the feature vector. Below we look at the predictive power of the prior model and the likelihood model as well as the complete model.

Prior Model
The prior model captures the probabilities of prepositions given ordered pairs of object labels L s , L o , where the normalised probabilities are obtained through a frequency count on the training set, using add-one smoothing. We then simply construe the model as a classifier to give us the most likely preposition v OL : where v j is a preposition in the set of prepositions V, and L s and L o are the object class labels of the first and second objects.

Likelihood Model
The likelihood model is based on a set of six geometric features computed from the image size and bounding boxes: F 1 : Area of Obj 1 (Bounding Box 1) normalized by Image size. F 2 : Area of Obj 2 (Bounding Box 2) normalized by Image Size. F 3 : Ratio of area of Obj 1 to area of Obj 2 . F 4 : Distance between bounding box centroids normalized by object sizes. F 5 : Area of overlap of bounding boxes normalized by the smaller bounding box. F 6 : Position of Obj 1 relative to Obj 2 . F 1 to F 5 are real valued features, whereas F 6 is a categorical variable over four values (N, S, E, W). For each preposition, the probability distributions for each feature is estimated from the training set. The distributions for F 1 to F 4 are modelled with a Gaussian function, F 5 with a clipped polynomial function, and F 6 with a discrete distribution. The maximum likelihood model, which can also be derived from the naive Bayes model described in the next section by choosing a uniform P (v) function, is given by:

Naive Bayes Model
The naive Bayes classifier is derived from the maximum-a-posteriori Bayesian model, with the assumption that the features are conditionally independent. A direct application of Bayes' rule gives the classifier based on the posterior probability distribution as follows: Intuitively, P (v j |L s , L o ) weights the likelihood with the prior or state of nature probabilities.

Results
The current data set comprises 1,000 images, each labelled with one or more prepositions. The average prepositions per image over the whole dataset is 2.01. For training purposes, we create a separate training instance (Obj s , Obj o , v) for each preposition v selected by our human annotators for the given object pair Obj s , Obj o . The models are evaluated with leave-one-out cross-validation, and two methods (Acc A and Acc B ) of calculating accuracy (the percentage of instances for which a correct output is returned). The notation e.g. Acc A (1..n) is used to indicate that in this version of the evaluation method at least one of the top n most likely outputs (prepositions) returned by the model needs to match the (set of) human-selected reference preposition(s) for the model output to count as correct.

Accuracy method A
Acc A (1..n) returns the proportion of times that at least one of the top n prepositions returned by a model for an ordered object pair is in the complete set of human-selected prepositions for the same object pair. Acc A can be seen as a system-level Precision measure. The table below shows Acc A (1) and Acc A (1..2) results for the three models:

Accuracy method B
Acc B (1..n) computes the mean of prepositionlevel accuracies. Accuracy for each preposition v is the proportion of times that v is returned as one of the top n prepositions out of those cases when v is in the human-selected set of reference prepositions. Acc B can be seen as a preposition-level Recall measure. Table 1 lists the Acc B (1..n) values for the v N B model for each n up to 4; values are shown for the 13 most frequent prepositions (in order of frequency) and for the mean of all preposition-level accuracies. The last row shows the means for a version of Acc B that takes synonyms into account as described in the last section.

Discussion
Looking at the naive Bayes results in Table 1, accuracy for some prepositions (e.g. close to) improves dramatically from Acc B (1) to Acc B (1..4). This implies that where the target preposition is not ranked first, it is often ranked second, third or fourth. There are synonym effects at work as shown by the Acc Syn results; but there also is competition between prepositions that are not near synonyms, as shown by the fact that Acc A (1..2) results are better than Acc Syn A (1) results. For some prepositions, accuracy remains low even at n=4. This may reflect the general issue that human annotators use two different perspectives in selecting prepositions: (i) that of a viewer looking at the image, and (ii) that of one or both of the objects involved in the spatial relationship being described. Regarding (i), e.g. in the image in Figure 1, the dog is 'in front of' the person because it is between the viewer and the person. Regarding (ii), in other examples, a person can be 'in front of' a monitor, or one chair 'opposite' another, even when the viewer sees them both from the side.
The naive Bayes framework we have investigated here is a simple approach which is likely to be outperformed by more sophisticated ML methods. E.g. in calculating the likelihood term P (F |v), our approach assumes the features to be independent; feature weighting per preposition was not carried out; and the data set is small relative to what we are using it for.

Conclusion
We have described (i) a dataset we are developing in which object pairs are annotated with prepositions that describe their spatial relationship, and (ii) methods for automatically predicting such prepositions on the basis of features computed from image and object geometry and object class labels. We have found that on the basis of language information (object class labels) alone we can predict prepositions with 34.4% accuracy, rising to 43.9% if we count near synonyms as correct. Using both language and visual information we can predict prepositions with 51% accuracy, rising to 57.2% with near synonyms. We have also found that where the target preposition is not ranked top, it is often ranked very near the top, as can be seen from the Acc B results.
The next step in this research will be to increase our dataset and to apply machine learning methods such as support vector machines and neural networks to our learning task.