What goes into a word: generating image descriptions with top-down spatial knowledge

Generating grounded image descriptions requires associating linguistic units with their corresponding visual clues. A common method is to train a decoder language model with attention mechanism over convolutional visual features. Attention weights align the stratified visual features arranged by their location with tokens, most commonly words, in the target description. However, words such as spatial relations (e.g. next to and under) are not directly referring to geometric arrangements of pixels but to complex geometric and conceptual representations. The aim of this paper is to evaluate what representations facilitate generating image descriptions with spatial relations and lead to better grounded language generation. In particular, we investigate the contribution of three different representational modalities in generating relational referring expressions: (i) pre-trained convolutional visual features, (ii) different top-down geometric relational knowledge between objects, and (iii) world knowledge captured by contextual embeddings in language models.


Introduction
Spatial recognition and reasoning are essential bases for visual understanding. Automatically generating descriptions of scenes involves both recognising objects and their spatial configuration. This project follows up on recent attempts to improve language generation and understanding in terms of using spatial modules in the fusion of vision and language (Xu et al., 2015;Johnson et al., 2016;Lu et al., 2017;Hu et al., 2017;Anderson et al., 2018) (see also Section 6).
Generating spatial descriptions is an important part of the image description task which requires several types of knowledge obtained from different modalities: (i) invariant visual clues for object identification, (ii) geometric configuration of the scene representing relations between objects relative to the size of the environment (iii) objectspecific functional relations that capture interaction between them and are formed by our knowledge of the world for example an umbrella is over a man is true if the referring umbrella serves its function, protecting the man from the rain (Coventry et al., 2001), and (iv) for projective relations (e.g. "to the left of" and "above") but not topological relations (e.g. "close" and "at"), the frame of reference which can be influenced from other modalities such as scene attention and dialogue interaction (Dobnik et al., 2015). Work in cognitive psychology (Logan, 1994(Logan, , 1995 argues that while object identification may be pre-attentive, identification of spatial relations is not and is accomplished by a top-down mechanisms of attention after the objects have been identified. It is also the case that we do not identify all possible relations between objects but only those that are attended by such top-down mechanisms considering different kinds of high-level knowledge. Experiments on training neural recurrent language models in a bottom-up fashion from data 1 demonstrated that spatial relations are frequently not learned to be grounded in visual inputs (Lu et al., 2017;Tanti et al., 2018a; which has been attributed to the design choices of these models that primarily focus on identification of objects (Kelleher and Dobnik, 2017). Therefore, targeted integration of different modalities is required to capture the properties from (i) to (iv). We can do this top-down (Anderson et al., 2018;Hu et al., 2017;Liu et al., 2017). However, it is not immediately obvious what kind of top-down spatial knowledge will benefit the bottom-up models most. Therefore, in this paper we investigate the integration of different kind of h "teddy bear", "partially under", "go cart"i top-down spatial knowledge beyond object localisation represented as features with the bottom-up neural language model. The paper is organised as follows. In Section 2, we discuss how spatial descriptions are constructed and what components are required to generate descriptions. In Section 3, the neural networks' design is explained. In Section 4, we explain what dataset is used for this study, what preprocessing was applied on it and how the models are trained. Then the experiments and evaluation results are presented in Section 5. The related work in relation to our methods and findings is discussed in Section 6. The conclusion is given in Section 7.

Generating Spatial Descriptions
When describing a scene, there are several ways to construct spatial descriptions referring to objects and places and their relation with each other. A spatial description has three parts: a TARGET and a LANDMARK referring to objects or places and a RELATION denoting the location of the target in relation to the landmark (Logan and Sadler, 1996). 2 These are in the example in Figure 1 as follows: There is a teddy bear Therefore generating such description requires (a) identification of objects and their locations: the target is what we want to describe and the landmark is what we will relate the target to; the salience of the landmark is important for the hearer. (b) Grounding of the relation in geomet-2 Sometimes these are also known as referent and relatum (Miller and Johnson-Laird, 1976), figure and ground (Talmy, 1983) or the located object and the reference object (Herskovits, 1986;Gapp, 1994;Dobnik, 2009). ric space: the spatial relation is expressed relative to the landmark which grounds a 3-dimensional coordinate system; furthermore, for projective relations, the coordinate system is aligned with the orientation of the external viewpoint which determines the frame of reference (Maillat, 2003). (Viewpoint may also be the landmark object itself in which case the coordinate system is oriented in the same way as the landmark). (c) Grounding in function: a spatial relation may be selected also based on the functional properties between target and landmark objects, e.g. the difference between "the teapot is over the cup" and "the teapot is above the cup" (Coventry et al., 2001).
Generating spatial descriptions requires knowing the intended target object and how we want to convey its location to the listener. The bottom-up approach in image captioning is focused on learning the salience of objects and events to generate captions expressed in the dataset (e.g. Xu et al. (2015)). The combination of bottom-up and topdown approaches for generating descriptions use modularisation in order to improve the generation of descriptions of different kind (e.g. You et al. (2016)). However, as we have seen in the preceding discussion, the generation of spatial descriptions requires a highly specific geometric knowledge. How is this knowledge approximated by the bottom-up models? To what degree can we integrate this knowledge with the top-down models? In this paper, we investigate these questions in a language generation task by comparing different variations of included top-down spatial knowledge. More specifically, for each image, we generate a description for every pair of objects that are localised in the image. We consider a variety of top-down spatial knowledge representations about objects as inputs to the model: (a) explicit object localisation and extraction of visual features; (b) explicit identification of the target-landmark by specifying their order in the feature vector; and (c) explicit geometric representation of objects in a 2D image. We investigate the contribution of each of these sets of features to generation of image descriptions.

Neural Network Design
Our method is to add step-by-step modules and configurations to the network providing different kind of top-down knowledge in Section 2 and investigating the performance of such configura-tions. There are several design choices with small effects on the performance but costly in terms of parameter size (Tanti et al., 2018b). Therefore, if there is no research question related to that choice, we take the simplest choice as reported in the previous work such as (Lu et al., 2017;Anderson et al., 2018). We use the following configurations: 1. Simple bottom-up encoder-decoder; 2. Bottom-up object localisation with attention; 3. Top-down object annotated localisation; 4. Top-down target and landmark assignment; 5. Two methods of top-down representation of geometric features (s-features). These five configurations give us 10 variations of the model design as shown in Table 1. A detailed definition of each module is given in the Appendix A in the supplementary material.
Generative language model We use a simple forward recurrent neural model with cross-entropy loss in all model configurations.
Simple encoder-decoder An encoder-decoder architecture without spatial attention shown in Figure 3a and similar to (Vinyals et al., 2015) is the simplest baseline for fusing vision and language. The input to the model is an image and the start symbol < s > of a description and the output is produced by the language model decoder. The embeddings are randomly initialised and learned as a parameter set of the model. The visual vectors are produced by a pre-trained ResNet50 (He et al., 2016). A multi-layer perceptron module (F v in Figure 2) is used to fine-tune the visual features. Bottom-up localisation With visual feature representing all regions of the image as in Figure 2, the attention mechanism is used as a localisation module. We generalised the adaptive attention introduced in (Lu et al., 2017) to be able to fuse the modalities. As shown in Figure 3b, the interaction between the attention mechanism and the language model is more similar to (Anderson et al., 2018): two layers of stacked LSTM, the first stack (LSTM a ) to produce the features for the attention model and the second stack (LSTM l ) to produce contextualised linguistic features which are fused with the attended visual features. This design is easier to extend with additional top-down vectors.
Top-down localisation Unlike the bottom-up unsupervised localisation, the top-down method includes a provision of a list regions of interest (ROI) from external procedures. For example, the region proposals can come from another bottom-up task as in (Anderson et al., 2018;Johnson et al., 2016) which use a Faster R- CNN (Ren et al., 2015) to extract possible regions of interest from the ConvNets regions in Figure 2. Here, as shown in Figure 4 we use the bounding box annotations of objects in images as the top-down localisation knowledge and then extract ResNet50 visual features from these regions. In the first stage the top-down visual representation only proposes visual vectors of the two objects in a random order without their spatial role as targets and landmarks in the descriptions. The model is shown in Figure 3d.
Top-down target-landmark assignment In the second iteration of the top-down localisation module we assign semantic roles to regions as targets and landmarks. This is directly related to localisation as spatial relations are asymmetric. We encode this top-down knowledge by fixing the order of the regions in the feature vector. The first object is the target and the second object is the landmark. Otherwise, the model is the same as in the previous iteration shown in Figure 3d.
Top-down geometric features The localisation procedure of objects discussed previously does not provide any geometric information about the relation between the two regions. However, top-down geometric features are required for grounding spatial relations where the location of the target object is expressed relative to the landmark. For example, a simple (but by no means sufficient) geometric relation between two bounding boxes can be represented by an arrow from the centre of one bounding box to the centre of the other and by ordering the information about bounding boxes in the feature vector as in the previous model to encode target-landmark asymmetry. The network architecture of the model with top-down geometric features expressing relations between the objects is shown in Figure 3e. We consider two different rep- Bottom-up (7 ⇥ 7 grid) Bottom-up attention - Figure (Figure 5b) where dx, dy are changes in the coordinates of the centres, ov, ov 1 , ov 2 the overlapping areas (total, relative to the first, and the second bounding box), h 1 , h 2 heights, w 1 , w 2 widths and a 1 , a 2 areas. Note that Mask features provide geometric information about the size and the location of objects relative to the picture frame and VisKE feature provide more detailed geometric information that expresses the relation between the objects. The latter therefore more closely match the features that were identified in spatial cognitive models. A feed-forward network with two layers (F s ) is used to project geometric features into a vector with the same dimensionality as the F v outputs so that different modalities are comparable in weighted sum model of attention. The geometric relation between the two bounding boxes are represented with features from (Sadeghi et al., 2015).

Dataset and Training
We use the relationship dataset in Visual Genome (Krishna et al., 2017) which is a collection of referring expressions represented as triplets hsubject, predicate, objecti on 108K images. Unlike image captioning datasets such as MSCOCO (Chen et al., 2015) and Flickr30K (Plummer et al., 2015) where only 5 captions are given for each image, each image in this dataset is annotated with 50 phrases. The annotators were asked to annotate relations between two given bounding boxes of subject and object by freely writing the text for each of the three parts of the annotation. The bounding boxes produced by another annotation procedure which detected objects in the images. In total, there are 2, 316, 104 annotations of 664, 805 unique triplets, 35, 744 unique labels of subjects and 21, 299 unique labels of objects most of which consist of multiple tokens. We omit all repetitions of triplets on each image, this leaves total 1, 614, 055 annotations. 3 Spatial relations Based on the lists of spatial prepositions in (Landau, 1996) and (Herskovits, 1986), we have created a dictionary of spatial relations and their possible multi-word variants including their composite forms. This dictionary contains 7, 122 entries of 235 relations (e.g. right to represent both on the right hand side of and to the right of ). Of these only 202 are found in Visual Genome dataset covering 79 spatial relations. 328, 966 unique triplets in Visual Genome are based on exactly one of these terms which covers 49.4% of all possible relationships. 4 Bounding boxes Each bounding box is a tuple of 4 numbers (x, y, w, h). We normalise the numbers to the range of (0, 1) relative to the image size to create geometric feature vectors (Section 3). The image is split into a grid with 7x7 cells to which bounding boxes are mapped, one bounding box potentially covering more than one cell. With this bounding box granularity, there are exactly 308, 330 possible bounding boxes. However, only 151, 974 are observed in the relationships dataset.
3 The repetitions include reflexive expressions (e.g. horse next to horse), annotations of several objects of the same type (e.g. cup on table), and repetitions due to several bounding box annotations of the same objects with different sizes. 4 Other triplets in Visual Genome also have spatial content. Some of them include modifiers such as partially under as in Figure 1 and some of them are descriptions of an event or an action such as sitting on and jumping over. Some annotated relationships are verbs such as flying with less obvious spatial denotation. The spatial bias in the dataset was studied in . The most frequent spatial relation in the dataset is "on" (over 450K instances), the second place is "in" (150K instances), then "with", variations of "behind", "near", "top", "next", "under", "front", and "by" (less than 10K instances each).
The spatial distribution of paired objects reflects how natural pictures are framed and how related objects are understood by annotators.
Pre-processing We first removed duplicate triplets describing the same image. Then we converted each triplet into a word sequence by concatenating the strings and de-tokenising them with the white space separator. This produced a corpus with a vocabulary of 26, 530 types with a maximum sequence length of 16 tokens and on average 15 referring expressions per image. We use 95% of the descriptions for training and 5% for validation and testing (5,230 images with 80,231 triplets).
Training We use Keras (Chollet et al., 2015) with TensorFlow backend (Abadi et al., 2015) to implement and train all of the neural network architectures in Section 3. The models are trained with the Adam optimiser (Kingma and Ba, 2014) (a = 0.001, b 1 = 0.9, b 2 = 0.999) with a batch size of 128 and 15 epochs.

Evaluation
All implementations are available online 5 . Figure 6 shows generated descriptions for two examples of unseen pictures from the test dataset by five models. The generated word sequence is that with the lowest loss using beam search with k = 5. The first example shows exactly how top-down localisation of objects is important especially if the goal is to refer to specific objects in the scene. In the second example, the visual features inside the bounding box are confusing for all 5 models. More examples are in Figure 13 in the Appendix.

Overall Model Performance
Hypothesis Top-down spatial knowledge improves the model performance. We consider three categories of top-down spatial knowledge: (i) topdown localisation of regions of interest; (ii) topdown assignment of semantic roles to regions; and (iii) two kinds of geometric feature vectors.
Method After training the models we evaluate them by calculating the average word level crossentropy loss on held out instances in the test set 6 . 5 https://gu-clasp.github.io/generate_ spatial_descriptions/ 6 Equivalent to log-perplexity of the language model. h "bat", "over", "shoulder"i simple player bu49 man wearing shirt td bat in hand td order bat in hand td order +VisKE bat in hand h "hood", "above", "oven"i simple window bu49 pot on stove td oven has door td order vent above sink td order +VisKE cabinet has door We also calculate the loss on descriptions containing specific spatial relations for qualitative understanding of the effects of each type of top-down knowledge.

Results
The overall loss of each model on the unseen descriptions of images is shown in Figure 7. The fully bottom-up model with no spatial attention (simple) has the highest loss. The loss in the variations of the model with bottom-up localisation in bu49 is higher than the one in the models with top-down localisation. The models with the top-down assignment of TARGET-LANDMARK achieves the best results. The effect of top-down geometric features is not significant.   Discussion The top-down localisation (td) certainly improves the performance of the language models compared to purely bottom-up representations. However, additional top-down assignment of TARGET-LANDMARK (td order) and their additional geometric arrangement of bounding box features (mask and VisKe) has a small positive effect on overall performance. The overall performance is not a representative of how these configurations effect the grounding of spatial relations. More specifically, the imbalance of certain groups of relations (especially a generally lower proportion of geometrically biased relations such as "left" and "right" in this dataset and the presence of relations with a minimum spatial content such as has, wearing) makes it harder to make conclusions about overall performance of the models. We further examine two groups of some frequent spatial relations. The relations such as inside and near represent one group and above and below represent the other. Some top-down knowledge (as represented by our features) is less informative for the first group but is informative for the second group. For example near does not require the assignment of TARGET-LANDMARK roles. We observe that td order is not performing better than td. On the other hand, inside is sensitive to TAR-GET-LANDMARK assignment. However, since the relation is also restricted by a choice of objects (only certain objects can be inside others) their TARGET-LANDMARK assignment may already be inferred without such top-down knowledge from a language model. For the second group, the topdown knowledge about the semantic role of objects is important. However, left and right are among the least frequent relations in the dataset which is demonstrated by the fact that their descriptions have a higher loss than above and below. For these relations the loss of the simple model is much higher than other configurations. It can be seen that td is performing better than bu and td order is contributing over td but geometric features have a lesser effect than identification of semantic roles (td order).

Grounding in features
Hypotheses With the aim to evaluate what topdown information contributed to grounding of words we examine the following hypotheses: H1 s-features contribute to predicting spatial relation words. H2 Without top-down TARGET-LANDMARK role assignments to each region, attention is uniformly distributed over region choices at the beginning of a sequence generation.
Method In order to check the contribution of each feature from different modalities in prediction of each word, we look at the adaptive attention on each feature at the point of predicting the word 7 . Since feature vectors are not normalised against the number of features of each modality, we first multiply each attention measure with the magnitude of the feature vector, and then we normalised it to sum to 1 again: where t refers to the time in the word sequence, and f i is the feature the attention of which ↵ t, f i is applied to it. We report the average t, f i over the instances in the validation dataset. Figure 9 shows on two examples in three models. For each word, the bar chart is divided between four features (in Figure 3e): (1) target v ob j 1 (2) landmark v ob j 2 (3) s-features for bounding boxes (4) contextualized embeddings h l . 7 In this experiment, we do not check if the estimated likelihood for the correct word is the highest predicted score. The generated descriptions may still be acceptable with an alternative spatial relation. Furthermore, in the following analysis we report the attention over semantic roles and not individual words. Figure 9: is plotted in bar charts for each word. (a) td order +VisKE (b) td +VisKE (c) td. The values of for each word that constitute description referring to each bounding box region is given in images.
After measuring the normalised attention on each feature according to Equation 1, we report the average of attentions on each token at that time step of the word sequence. We also group the tokens based on their semantic role in the triplets and report the average on these tokens for a given role.

Results
The average of attentions over triplets of tokens is plotted in Figure 10. The behaviour of attentions on word sequences in the four models in given in Figure 11. In the models without top-down semantic role assignment only the model with +VisKE features has the expected attention on target and landmark, but there is no attention on the sfeatures. In the models with top-down semantic role assignment, the model with VisKE s-features has higher attention on s-features when predicting a relation word (H1). A similar situation is observable over word sequences in Figure 11. Without prior semantic role assignment the model is more confused how to attend target or landmark (H2). Finally, note that geometric VisKE s-features help predicting the TARGET-LANDMARK roles when these are not assigned top-down.

Related Work
Generating referring expressions Generating locative expressions is part of the general field of generating referring expressions (Dale and Reiter, 1995;Krahmer and van Deemter, 2011) with applications such as describing scenes (Viethen and Dale, 2008) and images (Mitchell et al., 2012). The research on describing visible objects (Mitchell et al., 2013) and human-robot dialogue (Kelleher and Kruijff, 2006) raised question about grounding relations in hierarchical representation of context. Application of neural language models and using convolutional neural networks for encoding visual features is an open question in interactive GRE tasks.
Encoder-decoder models with attention Recently several methods focused on finding better neural architectures for generating image descriptions based on pre-trained convolutional neural networks have been introduced. Karpathy and Fei-Fei (2015) align descriptions with images. Vinyals et al. (2015) introduce an encoder-decoder framework. Xu et al. (2015) improve this approach with spatial attention. Lu et al. (2017) introduce adaptive attention that balances language and visual embeddings. The attention measure provides an explanation of encoder-decoder architectures on how each modality contributes to language generation. Based on the attended features the performance of these models can be examined (Liu et al., 2017;. In our paper, we develop a model similar to the adaptive attention which exploits its expressive aspects as a degree of grounding in different features. Outputs of external models as top-down features In another line of work, the output of the bottom-up visual understanding is used as topdown features for language generation. For example, an object detection pipeline is combined explicitly with language generation. This procedure was previously used in template-based language generation (Elliott and Keller, 2013;Elliott and de Vries, 2015). There have been attempts to combine this process with neural language models with attention. For example, You et al. (2016) extract candidate semantic attributes from images (e.g. a list of objects in the scene), then the attention mechanism is used to learn to attend on them when generating tokens of image descriptions. Instead of semantic attributes, Anderson et al. (2018) use a region proposal network from a pre-trained object detection model to extract the generated bounding box regions as possible locations of visual clues. Then, the attention model learns to attend on the visual features associated with these regions. The idea of using an object detection module is also used in Johnson et al. (2016) where Faster R- CNN (Ren et al., 2015) is used to find regions of interest. Instead of assigning one object class to each region, a full description is generated for each proposed region. In all of these models, an image understanding module extracts some proposed representations and then this knowledge is used as a top-down representation of the scene to generate an image description. In this paper, we investigate the extent to which differ-ent spatial information is facilitating as a top-down knowledge to generate descriptions of scenes with neural language models. Modular design Our paper examines strategies that can demonstrate language grounding within a neural architecture. The studies of neural architectures such as (Tanti et al., 2018b) provide analytical insight on differences between multimodal architectures for language generation. The modular design is mostly used in language parsing tasks such as (Hu et al., 2017) where object recognition, localisation and relation recognition are separate modules for grounding different parts of image descriptions in images in order to solve tasks such as visual question answering. In our paper, the modularity of the neural architecture is not focused on parsing text but used to incrementally demonstrate the contribution of each introduced modality to language generation.

Multimodal embeddings
There are related studies on learning multimodal embeddings (Kiros et al., 2014;Lazaridou et al., 2015) to represent vision and language in the same semantic space. The focus of our paper is to investigate how these different modalities complement each other in neural language generation. In our models, the semantic representations of spatial relations are considered as a separate modality extending both the language and visual embeddings. There are related studies on encoding spatial knowledge in feature space in order to predict spatial prepositions (Ramisa et al., 2015) or on prepositional embeddings which can predict regions in space . In our paper, we investigate the degree in which each embedding contributes to language generation within the neural language model.

Conclusions
We explored the effects of encoding top-down spatial knowledge in a bottom-up trained generative neural language model for the image description task. The findings of the experiments in this paper are as follows: (1) Overall, integration of top-down knowledge has a positive effect on grounded neural language models for this task.
(2) When combining bottomup language grounding with top-down knowledge representation as different features, different types of top-down knowledge have different contribution to grounded language models. The general picture is further complicated by the fact that different spatial relations have different bias to different knowledge. (3) The performance gain from the geometric features extracted from bounding boxes (s-features) is smaller than initially expected, with two possible explanations related to the nature of the corpora of image descriptions: (i) The corpus contains images of typical scenes where the relation of objects with each other is predictable from the description and therefore is captured in the language model; (ii) As annotators are focused on describing "what is in the image" rather "where things are spatially in relation to each other", descriptions of geometric spatial relations which refer to the locational information are rare in the corpus. (4) The majority of attention is placed on the language model which demonstrates that this provides significant information when generating spatial descriptions. While this may be a confounding factor if the visual features are ignored, the language model also encodes useful information about spatial information as discussed in (Kulkarni et al., 2011;.
The results open several questions about grounded language models. Firstly, the degree to which the system is using each modality can be affected by dataset biases and this should be taken into account in the forthcoming work. Given this bias, learning a single common language model for descriptions of spatial scenes is insufficient as different kinds of knowledge may come to focus in different interactional scenarios. This further supports the idea that top-down integration of knowledge is required where we hope that the models will learn to attend to the appropriate features. Secondly, our investigation leaves open the question whether the representations both visual and geometric that we use are good representations for learning spatial relations. Further work will include a focused investigation of what kind of geometric relations they encode.