Describing Spatial Relationships between Objects in Images in English and French

The context for the work we report here is the automatic description of spatial relationships between pairs of objects in images. We investigate the task of selecting prepositions for such spatial relationships. We describe the two datasets of object pairs and prepositions we have created for English and French, and report results for predicting prepositions for object pairs in both of these languages, us-ing two methods: (a) an existing approach which manually ﬁxes the mapping from geometrical features to prepositions, and (b) a Naive Bayes classiﬁer trained on the English and French datasets. For the latter we use features based on object class labels and geometrical measurements of object bounding boxes. We evaluate the automatically generated prepositions on un-seen data in terms of accuracy against the human-selected prepositions.


Introduction
Automatic image description is important not just for assistive technology, but also for applications such as text-based querying of image databases. A good image description will, among other things, refer to the main objects in the image and the relationships between them. Two of the most important types of relationships for image description are activities (e.g. a child riding a bike), and spatial relationships (e.g. a dog in a car).
The task we investigate is predicting the prepositions that can be used to describe spatial relationships between pairs of objects in images. This is an important subtask in image description, but it is rarely addressed as a subtask in its own right. If an image description method produces spatial prepositions it tends to be as a side-effect of the overall method (Mitchell et al., 2012;Kulkarni et al., 2013), or else relationships are not between objects, but e.g. between objects and the 'scene' (Yang et al., 2011). An example of preposition selection as a separate subtask is Elliott & Keller (2013) where the mapping is rule-based.
Spatial relations also play a role in referring expression generation (Viethen and Dale, 2008;Golland et al., 2010) where the problem is, however, often framed as a content selection problem from known abstract representations of the objects and scene, and the aim is to enable unique identification of the object referred to.
Our main data source is a corpus of images (Everingham et al., 2010) in which objects have been annotated with rectangular bounding boxes and object class labels. For a subset of 1,000 of the images we also have five human-created descriptions of the whole image (Rashtchian et al., 2010).
We collected additional annotations for the images listing, for each object pair, a set of prepositions that have been selected by human annotators as correctly describing the spatial relationship between the given object pair (Section 2.3). We did this in separate experiments for both English and French.
The overall aim is to create models for the mapping from image, bounding boxes and labels to spatial prepositions as indicated in Figure 1. We compare two approaches to modelling the mapping. One is taken from previous work (Elliott and Keller, 2013) and defines manually constructed rules to implement the mapping from image ge-beside(person(Obj 1 ), person(Obj 2 )); −→ beside(person(Obj 2 ), dog(Obj 3 )); in front of(dog(Obj 3 ), person(Obj 1 )) ometries to prepositions (Section 3.1). The other is a Naive Bayes classifier trained on a range of features to represent object pairs, computed from image, bounding boxes and labels (Section 3.2). We report results for English and French, in terms of two measures of accuracy (Section 5).
2 Data 2.1 VOC'08 The PASCAL VOC 2008 Shared Task Competition (VOC'08) data consists of 8,776 images and 20,739 objects in 20 object classes (Everingham et al., 2010). In each image, every object in one of the 20 VOC'08 object classes is annotated with six types of information of which we use the following three: Examples of all six types of annotation can be seen in Figure 2. We use the object class labels in predicting prepositions, and for the French experiments we translated them as follows (in the same order as the English labels above): l'avion, l'oiseau, le vélo, le bateau, la bouteille, le bus, la voiture, le chat, la chaise, la vache, la table, le chien, le cheval, la moto, la personne, la plante, le mouton, le canapé, le train, l'écran 2.2 VOC'08 1K Using Mechanical Turk, Rashtchian et al. (2010) collected five descriptions each for 1,000 VOC'08 images selected randomly but ensuring even distribution over the VOC'08 object classes. Turkers had to have high hit rates and pass a language competence test before creating descriptions, leading to relatively high quality. We obtained a set of candidate prepositions from the VOC'08 1K dataset as follows. We parsed the 5,000 descriptions with the Stanford A main holds two bikes near a beach. A young man wearing a striped shirt is holding two bicycles. Man with two bicycles at the beach, looking perplexed. Red haired man holding two bicycles. Young redheaded man holding two bicycles near beach. (BB = bounding box; image reproduced from http://lear.inrialpes.fr/RecogWorkshop08/documents/everingham.pdf.) Parser version 3.5.2 1 with the PCFG model, extracted the nmod:prep prepositional modifier relations, and manually removed the non-spatial ones. This gave us the following set of 38 prepositions: V E = { about, above, across, against, along, alongside, around, at, atop, behind, below, beneath, beside, beyond, by, close to, far from, in, in front of, inside, inside of, near, next to, on, on top of, opposite, outside, outside of, over, past, through, toward, towards, under, underneath, up, upon, within } For the list of French prepositions we started by compiling the list of possible translations of the English prepositions, after which we checked the list against 200 example images which resulted in a few additions and deletions. The final list for French has the following 21 prepositions (note there is no 1-to-1 correspondence with the English prepositions): V F = {à côté de, a l'interieur de, a l'éxterieur de, au dessus de, au niveau de, autour de, contre, dans, derrière, devant, en dessous de, en face de, en haut de, en travers de, le long de, loin de, par delà, parmi, près de, sous, sur } 1 http://nlp.stanford.edu/software/lex-parser.shtml

Human-Selected Spatial Prepositions
We are in the process of extending the VOC'08 annotations with human-selected spatial prepositions associated with pairs of objects in images. So far we have collected spatial prepositions for object pairs in images that have exactly two objects annotated (1,020). Annotators were presented with images from the dataset where in each image presentation the two objects, Obj 1 and Obj 2 , were shown with their bounding boxes and labels. If there was more than one object of the same class, then the labels were shown with indices (numbered in order of decreasing size of bounding box).

English data
Next to the image was shown the template sentence "The Obj 1 is the Obj 2 ", and the list of possible prepositions extracted from VOC 1K (see last section). The option 'NONE' was also available in case none of the prepositions was suitable (but participants were discouraged from using it). Table 1 shows occurrence counts for the 20 object class labels, while the two columns on the left of Table 2 show how many times each preposition was selected by the annotators in the English version of the experiment. The average number of prepositions per object pair chosen by the English annotators was 2.01.
Each pair of objects was presented twice, the template incorporating the objects once in each or-  der, "The Obj 1 is the Obj 2 " and "The Obj 2 is the Obj 1 ". 2 Participants were asked to select all correct prepositions for each pair.

French Data
The experimental design and setup was the same as for the English. The template sentence for the French data collection was "Obj 1 est Obj 2 ", with the determiners included in the labels (see end of Section 2.2); e.g. "La plante est l'écran". Table 1 shows occurrence counts for the 20 object class labels, while the two columns on the right of Table 2 show how many times each preposition was selected by the annotators in the French version of the experiment. The average number of prepositions per object pair chosen by the French annotators was 1.73.

Predicting Prepositions
When looking at a 2-D image, people infer all kinds of information not present in the pixel grid on the basis of their practice mapping 2-D information to 3-D spaces, and their real-world knowledge about the properties of different types of ob-2 Showing objects in both orders is necessary for nonreflexive prepositions such as under, in, on, but also allows for other (unknown) factors that may influence preposition choice such as respective size of first and second object.
jects. In our research we are interested in the extent to which prepositions can be predicted without any real-world knowledge, using just features that can be computed from the image and the objects' bounding boxes and class labels.
In this section we look at two methods for mapping language and visual image features to prepositions. Each takes as input an image in which two objects in the above object classes have been annotated with rectangular bounding boxes and object class labels, and returns as output preposition(s) that describe the spatial relationship between the two objects in the image.

Rule-based method
The rule-based method we examine is a direct implementation of the eight geometric relations defined in Visual Dependency Grammar (Elliott and Keller, 2013;Elliott, 2014). An overview is shown in Figure 3, for details see Elliott (2014, p. 13ff).
In order to implement these rules as a classifier, we pair each rule with the preposition referenced in it. In the case of surrounds, we use around instead. Two of the relations are problematic for us to implement, namely behind and in front of, because they make use of manual annotations that in fact encode whether one object is behind or in front of the other. We do not have this information available to us in our annotations.
What we do have is the 'occluded' flag (see list of VOC'08 annotations in Section 2.1 and Figure 2) which encodes whether the object tagged as occluded is partially hidden by another object. The problem is that the occluding object is not necessarily one of the two objects in the pair under consideration, i.e. the occluded object might be behind something else entirely. Nevertheless, the 'occluded' flag, in conjunction with bounding box overlap, gives us an angle on the definition of in front of ('the Z-plane relationship is dominant'); we define the two problematic relations as follows: Y is tagged 'occluded' and the overlap between X and Y is more than 50% of the bounding box area of Y.
X is tagged 'occluded' and the overlap between X and Y is more than 50% of the bounding box area of X.
In pseudocode, and for English, our implementation looks as follows (a is the centroid angle, P is the output list of prepositions, and 'overlap' is the area of the overlap between the bounding boxes of Object 1 and Object 2):  For evaluating the rule-based classifier against the French human-selected prepositions we translated the eight English prepositions as follows (listed in the same order as in Figure 3): sur, autour de,à côté de, en face de, au dessus de, en dessous de, devant, derrière

Naive Bayes Classifier
Our second preposition selection method is a Naive Bayes Classifier. Below we describe how we model the prior and likelihood terms, before describing the whole model. The terms come together as follows under Naive Bayes: Model ENGLISH FRENCH Acc A (1..n) Acc A (1..n) n = 1 n = 2 n = 3 n = 4 n = 1 n = 2 n = 3 n = 4  where v j ∈ V are the possible prepositions, and F is the feature vector.

Prior Model
The prior model captures the probabilities of prepositions given ordered pairs of object labels L s , L o , where the normalised probabilities are obtained through a frequency count on the training set, using add-one smoothing.
In order to test this model separately, we simply construe it as a classifier to give us the most likely preposition v OL : where v j is a preposition in the set of prepositions V, and L s and L o are the object class labels of the first and second objects.

Likelihood Model
The likelihood model is based on a set of six geometric features computed from the image size and bounding boxes: F 1 : Area of Obj 1 (Bounding Box 1) normalized by Image size. F 2 : Area of Obj 2 (Bounding Box 2) normalized by Image Size. F 3 : Ratio of area of Obj 1 to area of Obj 2 . F 4 : Distance between bounding box centroids normalized by object sizes.  For each preposition, the probability distributions for each feature is estimated from the training set. The distributions for F 1 to F 4 are modelled with a Gaussian function, F 5 with a clipped polynomial function, and F 6 with a discrete distribution.
For separate evaluation, a maximum likelihood model, which can also be derived from the Naive Bayes model described in the next section by choosing a uniform P (v) function, is given by:

Complete Naive Bayes Model
The Naive Bayes classifier is derived from the maximum-a-posteriori Bayesian model, with the assumption that the features are conditionally independent. A direct application of Bayes' rule gives the classifier based on the posterior probability distribution as follows: Intuitively, P (v j |L s , L o ) weights the likelihood with the prior or state of nature probabilities.

Evaluation Measures
We use two methods (Acc A and Acc B ) of calculating accuracy (the percentage of instances for    .4) for v N B and v RB models. Shown: all prepositions of frequency 20 and above, in order of frequency. Also included are less frequent words if they are in the set of eight prepositions produced by the v RB method. which a correct output is returned). The notation Acc A (1..n) or Acc B (1..n) is used to indicate that in this version of the evaluation method at least one of the top n most likely outputs (prepositions) returned by the model needs to match one of the human-selected reference prepositions for the model output to count as correct.
Furthermore, we use the notation Acc Syn A (1..n) or Acc Syn B (1..n) to indicate that in this version, at least one of the top n most likely outputs (prepositions) returned by the model, or one of its near synonyms, needs to match one of the humanselected reference prepositions for the model output to count as correct. For the rule-based selection method we do not have the ranked outputs needed to compute Acc A and Acc B . Interpreting the output set P directly as ranked would mean preserving the order in which prepositions are selected by rules which is likely to be unfair to this method. Instead we randomly shuffle P and then interpret it as ranked, with the first in this shuffled list giving the highest ranked output v RB . To be on the safe side we average all results over 10 different random shuffles. Note that from n = 4 upwards, it makes no difference whether the outputs are truly ranked or not.   .4) for v N B and v RB models. Shown: all prepositions of frequency 10 and above, in order of frequency. Also included are less frequent words if they are in the set of eight prepositions produced by the v RB method.
Accuracy measure A: Acc A (1..n) returns the proportion of times that at least one of the top n prepositions returned by a model for an ordered object pair is in the set of all human-selected prepositions for the same object pair. Acc A can be seen as a system-level Precision measure.
Accuracy measure B: Acc B (1..n) computes the mean of preposition-level accuracies. Accuracy for each preposition v is the proportion of times that v is returned as one of the top n prepositions out of all cases where v is in the humanselected set of reference prepositions. Acc B can be seen as a preposition-level Recall measure.

Results
The current French and English data sets each comprise 1,000 images/object-pair items, each of which is labelled with one or more prepositions. For training purposes, we create a separate training instance (Obj s , Obj o , v) for each preposition v selected by our human annotators for the context 'The Obj s is v the Obj o ' (or the French equiv-alent). The models are trained and tested with leave-one-out cross-validation. Table 3 shows English and French Acc A and Acc Syn A results for the rule-based method (v RB ), the prior model (v OL ), the likelihood model (v M L ), and the Naive Bayes model (v N B ). The main results are the Acc A (1) results, because after all a method needs to select a single preposition in order to be usable, e.g. in image description.
Acc Syn A (1) gives an idea of how much greater a proportion of a method's outputs would be considered correct by human evaluators.
The remaining measures give various perspectives on the proportion of times a method came close to getting it right, for four degrees of 'close'. E.g. Acc Syn A (1..4) shows what proportion of times one of the top 4 prepositions generated by a method, or one of their near synonyms, was in the reference set.
It is clear that the English results are more affected by synonym effects. E.g. Acc A (1..n) for English is nearly 10 percentage points lower than for French for all n, whereas this difference all but disappears for Acc Syn A (1..n). Overall, the v N B method always achieves the best result, as expected. The v M L model seems to be better at English than French, whereas for v OL it is the other way around.
Generally, once synonyms are taken into account, the results are strikingly similar for English and French, with the exception of the V M L model which does worse for French.
Tables 4 and 5 list the Acc B (1..n), n ≤ 4 and Acc Syn A (1..n), n ∈ {1, 4} results for the v N B and v RB models; values are shown for the most frequent prepositions (in order of frequency) and for the mean of all preposition-level accuracies. We are not showing all prepositions partly for reasons of space, but also because for the low frequency prepositions, the models tend to underfit or overfit noticeably.
Note that here too we consider the Acc A (1) and Acc Syn A (1) figures to be the main results. Among the English prepositions that v N B does well with (considered under the main Acc B (1) measure) are beside, near, underneath, far from, and results for on are particularly good; v RB does well for beside.
As for French, v N B does well withà côté de, contre, sur, loin de, while results for sous are particularly good. v RB does well forà côté de. Apart from near, underneath and contre, these are the same prepositions, semantically, as the English ones the methods do well with.

Conclusion
We have described (i) English and French datasets in which object pairs are annotated with prepositions that describe their spatial relationship, and (ii) methods for automatically predicting such prepositions on the basis of features computed from image and object geometry (visual information) and from object class labels (language information).
The main method we tested, a Naive Bayes classifier which takes both language and vision information into account, does best in terms of all evaluation methods we used, and it does better on English than on French. When evaluated separately, the prior model which is based on language information only, outperforms the likelihood model which is based on visual information only, in terms of the main evaluation measures Acc A (1) and Acc Syn A (1). Main results in the region of 50% leave room for improvement; the fact that these go up to around 70% when the top 4 results are taken into account indicates that the method gets it nearly right a lot of the time and that for a smaller set of prepositions, and with more sophisticated machine learning methods, better results will be obtained.
It seems clear from the results, and intuitively obvious, that a greater presence of near synonyms in the data makes for a harder modelling task. We had a principled reason for using this particular set of English prepositions: it is the set observed in the human-authored descriptions we used (see Section 2.2). In our future work we will also work with the single best prepositions chosen by annotators to describe spatial relationships. This seems likely to result in a smaller list of prepositions overall and an easier modelling task. In order to get a truer impression of the quality of results we will also carry out human evaluation.