SpatialVOC2K: A Multilingual Dataset of Images with Annotations and Features for Spatial Relations between Objects

We present SpatialVOC2K, the first multilingual image dataset with spatial relation annotations and object features for image-to-text generation, built using 2,026 images from the PASCAL VOC2008 dataset. The dataset incorporates (i) the labelled object bounding boxes from VOC2008, (ii) geometrical, language and depth features for each object, and (iii) for each pair of objects in both orders, (a) the single best preposition and (b) the set of possible prepositions in the given language that describe the spatial relationship between the two objects. Compared to previous versions of the dataset, we have roughly doubled the size for French, and completely reannotated as well as increased the size of the English portion, providing single best prepositions for English for the first time. Furthermore, we have added explicit 3D depth features for objects. We are releasing our dataset for free reuse, along with evaluation tools to enable comparative evaluation.


Introduction
Research in image labelling, description and understanding has a long tradition, but has recently seen explosive growth. Work in this area is most commonly motivated in terms of accessibility and data management, and has a range of different specific application tasks. One current research fo-cus is detection of relations between objects, in particular for image description generation, and the research presented here contributes to this line of work with a new dataset, SpatialVOC2K, 1 in which object pairs in images have been annotated with spatial relations encoded as sets of prepositions, specifically for image-to-text generation. We start below with the source datasets from which we obtained the images, bounding boxes, and candidate prepositions (Section 2), followed by an overview of directory structure and file schemas (Section 3), and a summary of the annotation process (Section 4) and spatially relevant features (Section 5). We describe the two evaluation tools supplied with the dataset (Section 6), and finish with a survey of other datasets with object relation annotations (Section 7).

Source Data
Our main data source for SpatialVOC2K was the PASCAL VOC2008 image dataset (Everingham et al., 2010) in which every object belonging to one of 20 object classes is annotated with class label, bounding box (BB), viewpoint, truncation, occlusion, and identification difficulty (Everingham et al., 2010). Of these annotations we retain just the BB geometries and the class labels (aeroplane, bird, bicycle, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, sofa, train, tv/monitor).
We also used Rashtchian et al.'s VOC'08 1K corpus (2010), which has 5 descriptions per im-age obtained via Mechanical Turk for 50 images from each VOC2008 class, in order to determine an initial set of candidate prepositions for our annotations (for details see Section 4). Due to quality control measures, the VOC'08 1K descriptions are of relatively high quality with few errors.
For SpatialVOC2K, we selected all images from the VOC2008 data that had two or three object bounding boxes (BBs), meaning that images contained exactly two and three objects from the VOC2008 object classes, respectively. We also selected all images with four and five BBs where three were of normal size and the remainder very small (bearing the VOC2008 label 'difficult'). This selection process resulted in a set of 2,026 images with 9,804 unique object pairs. Numbers of BBs in images were distributed as follows: For each image, we then (i) collected additional annotations (Section 4) which list, for each ordered object pair, (a) the single best, and (b) all possible prepositions that correctly describe the spatial relationship between the objects; and (ii) computed a range of spatially relevant features from the image and BB geometries, BB labels, and image depth maps (Section 1). All annotations and features are included in this dataset release.

SpatialVOC2K Structure and Schemas
The overall structure and file conventions of the SpatialVOC2K dataset mirror those of the VOC2008 dataset where possible: All files in the Annotations directory start with a line that is simply the original annotations from VOC2008. In the Best subdirectory, the remaining lines have the pattern Object 1 Object 2 Preposition, where Object 1 and Object 2 are the exact word strings, including any subscripts, of the object labels in the first line in the file, and Preposition is the single best preposition chosen by annotators for the two given objects presented in the given order (more about object order in Section 4 below). Each pair of annotated objects is thus associated with exactly two prepositions in the Best files, the best humanselected preposition for each order. The following is a simple example of a Best file: In the All directory, files have the same structure except that in the preposition lines, instead of a single preposition, there are as many prepositions as were selected by the human annotators as possible for a given ordered object pair. The Spatial Features files also have the same basic structure, except that instead of prepositions, there are 19 feature-value pairs (explained in Section 5) for each ordered object pair (some feature values differ depending on object order), e.g.: In the following three sections, we explain how we obtained the preposition annotations and spatial features, and how the metrics encoded by the evaluation tools are defined.
ones. This gave us 38 English prepositions: V 0 E = { about, above, across, against, along, alongside, around, at, atop, behind, below, beneath, beside, beyond, by, close to, far from, in, in front of, inside, inside of, near, next to, on, on top of, opposite, outside, outside of, over, past, through, toward, towards, under, underneath, up, upon, within } To obtain prepositions for French, we first asked two French native speakers to compile a list of possible translations of the English prepositions, and to check these against 200 sample images randomly selected from the complete set to be annotated. This produced 21 prepositions which were reduced to 19, based on evidence from previous work (Muscat and Belz, 2015), by eliminating prepositions that were used fewer than three times by annotators (en haut de, parmi). After the first batch of 1,020 images had been annotated, we furthermore merged prepositions which co-occur with another preposition more than 60% 3 of the times they occur in total (á l'interieur de, en dessous de), in accordance with the general sense of synonymity defined in previous work (Muscat and Belz, 2017). We found this kind of cooccurrence to be highly imbalanced, e.g. the likelihood of seeingá l'interieur de given dans is 0.43, whereas the likelihood of seeing dans giveń a l'interieur de is 0.91. We take this as justification for mergingá l'interieur de into dans, rather than the other way around, and proceed in this way for all prepositions. The process leaves a final set of 17 French prepositions: V F = {à côté de,á l'éxterieur de, au dessus de, au niveau de, autour de, contre, dans, derrière, devant, en face de, en travers de, le long de, loin de, par delà, près de, sous, sur} We also reduced the set of 38 English prepositions, using the same elimination process, starting with prepositions that occurred fewer than three times (toward, towards, about, across, along, outside, outside of, through, up). A further 12 prepositions were merged into others (within, inside, inside of, beside, alongside, by, against, upon, atop, on top of, beneath, under), yielding a final set of 17 English prepositions: V E = { above, around, at, behind, below, beyond, close to, far from, in, in front of, near, next to, on, opposite, over, past, underneath } As discussed in more detail in previous work (Muscat and Belz, 2017), we make the domainspecific assumption that there is a one-to-one mapping from each preposition to the SR it denotes (whereas an SR can map to multiple prepositions). While our machine learning task is SR detection, we ask annotators to annotate our data with the corresponding prepositions (a more humanfriendly task).
We used the above preposition sets in collecting annotations as follows. For each object pair O i and O j in each image, and for both orderings of the object labels, L i , L j and L j , L i , the task for annotators was to select (i) the single best preposition for the given pair (free text entry), and (ii) the possible prepositions for the given pair (selected from a given list) that accurately described the relationship between the two objects in the pair, given the template L 1 is L 2 (is becomes et for French). Even though in annotation task 1, annotators were not limited in their choice of preposition, they did not use any that were not in the list of prepositions offered in annotation task 2 (a few typos we corrected manually). As it would have been virtually impossible to remember the exact list of prepositions and only use those, we interpret this as meaning that annotators did not feel other prepositions were needed.
We used average pairwise kappa to assess interannotator and intra-annotator agreement as described in previous work (Muscat and Belz, 2017). First, figures for the first batch of French annotations (1,020 images with 2 or 3 objects in BBs 4 ). For single best prepositions (annotation task 1), average inter-annotator agreement was 0.67, and average intra-annotator agreement was 0.81. For all possible prepositions (annotation task 2), average inter-annotator agreement was 0.63, and average intra-annotator agreement was 0.77.
For the second batch of French annotations (1,006 images with 3, 4 or 5 BBs), average interannotator agreement for single best prepositions (annotation task 1) was 0.33, and average intraannotator agreement was 0.66. For all possible prepositions (annotation task 2), average interannotator agreement was 0.3, and average intraannotator agreement was 0.62. A possible reason for the lower annotator agreement on batch 2 is that as the number of dominant objects in an im- Area of overlap of bounding boxes normalized by the area of the smaller bounding box.
[0, 1] F 7: Distance between centroids divided by sum of square root of areas/2 (approximated average width of bounding boxes).
[0, ∼20] F 8: Position of Objs relative to Objo expressed as one of 4 categories, depending on the angle with the vertical axis.
[∼-40, ∼+40]  age increases, the annotation task becomes more difficult; we also used different annotators for the second batch which may be a contributing factor. 5  nington et al., 2014) for the object labels. 6 F2-F14 are visual features measuring various aspects of the geometries of the image and two bounding boxes (BBs). Most features express a property of just one of the objects, but F4-F9 express a property of both objects jointly, e.g. F6 is the normalized BB overlap. F17 and F18 are the average pixel-level depth value within the BB of Obj s and Obj o , respectively. Pixel-level depth values were computed via the method described in (Birmingham et al., 2018), which uses depth maps computed with monoDepth 7 (Godard et al., 2017) .

Evaluation Tools
SpatialVOC2K includes two evaluation tools which we have used in all previous work involving similar data. The two tools, systemAccuracy and relationPrecision implement the following two methods, respectively.
System-level Accuracy: There are four different variants of system-level Accuracy, denoted Acc(n), n ∈ {1, 2, 3, 4}. Each variant returns Accuracy rates for the top n outputs returned by systems, in the sense that a system output is considered correct if at least one of the reference prepositions (the human-selected prepositions from the dataset annotations) can be found in the top n prepositions returned by the system (for n = 1 this yields standard Accuracy).
Weighted Average Per-preposition Precision: This measure, denoted Acc P , computes the weighted mean of individual per-preposition precision scores. The individual per-preposition precision for a given system and a given preposition p is the proportion of times that p is among the corresponding human-selected prepositions out of all the times that p is returned as the top-ranked preposition by the system.

Related Datasets
A number of datasets are available that incorporate annotations representing relations between objects  in images. Types of relationships that have been annotated include actions (e.g. person kicks ball), other verbal relations (person wears shirt), spatial relations (person on horse), and comparative relations (one car bigger than another). In this section, we provide a brief overview of available datasets with relation annotations, in terms of their stated purpose (application task), the types of relations included, the range of spatial prepositions included, as well as size and other properties of the dataset. Table 2 has a summary of the datasets.
Visual Phrases (Sadeghi and Farhadi, 2011) was the first image dataset with object relation annotations, and used the concept of a visual phrase (VP) which is defined as a bounding box that surrounds two objects in an image. Out of 17 different types of VPs annotated in the data set, 13 comprise 2 objects, and 4 comprise one object. However, there are 120 predicates per object category.
Visual and Linguistic Treebank (Elliott and Keller, 2013) contains 341 images that are annotated with regions (362 in total) and visual dependency representations, which unfold to a total of 5,748 spatial relations (from a set of 8) and are aligned to the dependency parse of the image description. This setup allows for the prediction of actions as well as spatial relations (using a set of 8 manual created rules).
Scene Graphs (Johnson et al., 2015) is a dataset of 5,000 human-generated scene graphs grounded to images; scene graphs describe objects and their relationships.
ViSen (Ramisa et al., 2015) associates sets of (object 1, preposition, object 2) triples with images, where the triples have been extracted from parses of the image descriptions in MSCOCO (Lin et al., 2014) and Flickr30k (Young et al., 2014). Prepositions covered include all those extracted from the image descriptions including non-spatial ones. By far not all descriptions contain prepositions so not all images have spatial relation annotations; the task addressed is preposition prediction, not spatial relation prediction.
Visual Relationships Dataset (VRD) (Lu et al., 2016) contains 5,000 images, 100 object categories, 6,672 unique relationships, and 24.25 relations per object category. Scant information is available about how the dataset was created other than that relations broadly fit into the categories action, verbal, spatial, preposition and comparative.
Visual Genome (Krishna et al., 2017) contains 108K images, split into 4M regions, corresponding to 108K scene graphs and about 4K region graphs, 1.5M object-object relations, 40K unique relations, and an average of 17 relations per image and 0.63 relations per region.