The Use of Object Labels and Spatial Prepositions as Keywords in a Web-Retrieval-Based Image Caption Generation System

In this paper, a retrieval-based caption generation system that searches the web for suitable image descriptions is studied. Google’s reverse image search is used to find potentially relevant web multimedia content for query images. Sentences are extracted from web pages and the likelihood of the descriptions is computed to select one sentence from the retrieved text documents. The search mechanism is modified to replace the caption generated by Google with a caption composed of labels and spatial prepositions as part of the query’s text alongside the image. The object labels are obtained using an off-the-shelf R-CNN and a machine learning model is developed to predict the prepositions. The effect on the caption generation system performance when using the generated text is investigated. Both human evaluations and automatic metrics are used to evaluate the retrieved descriptions. Results show that the web-retrieval-based approach performed better when describing single-object images with sentences extracted from stock photography websites. On the other hand, images with two image objects were better described with template-generated sentences composed of object labels and prepositions.


Introduction
The automatic generation of concise natural language descriptions for images is currently gaining immense popularity in both Computer Vision and Natural Language Processing communities (Bernardi et al., 2016). The general process of automatically describing an image fundamen-tally involves the visual analysis of the image content such that a succinct natural language statement, verbalising the most salient image features, can be generated. In addition, natural language generation methods are needed to construct linguistically and grammatically correct sentences. Describing image content is very useful in applications for image retrieval based on detailed and specific image descriptions, caption generation to enhance the accessibility of current and existing image collections and most importantly as an assistive technology for visually impaired people . Research work on automatic image description generation can be organised in three categories (Bernardi et al., 2016). The first group generates textual descriptions from scratch by analysing the composition of an image in terms of image objects, attributes, scene types and event actions, extracted from image visual features. The other groups describe images by retrieving sentences either from visual space composed of image-description pairs or from a multimodal space that combines image and sentences in one single space. As opposed to direct-generationbased methods, the latter two approaches generate less verbose and more human-like descriptions. In this paper, a web-retrieval-based system that exploits the ever-growing vision-text content is studied while exploring how object labels and prepositions affect the retrieval of image descriptions. This paper is organised as follows: section 2 gives an overview of existing image caption algorithms. Section 3 outlines the problem definition and section 4 presents a web-retrieval-based framework followed by its implementation details in section 5. The dataset and evaluation are discussed in sections 6 and 7 respectively. The results are presented in section 8 followed by a discussion in section 9. Finally, section 10 concludes with the main observations and the future direction.

Related Work
Direct-generation models (Fang et al., 2015;Yang et al., 2011), exploit the image visual information to derive an image description by driving a natural language generation model such as n-grams, templates and grammar rules. Despite producing correct and relevant image descriptions, this approach tends to generate verbose and non-human-like image captions. The second and most relevant group of models to this paper, tackles the problem of textually describing an image as a retrieval problem. There are attempts that make use of pre-associated text or meta-data to describe images. For instance, Feng and Lapata (2010) generated captions for news images using an extractive and abstractive generation methods that require relevant text documents as input to the model. Similarly, Aker and Gaizauskas (2010) relied on GPS metadata to access relevant text documents to be able to generate captions for geo-tagged images. Other models formulate descriptions by finding visually similar images to the query images from a collection of already-annotated images. Query images are then described either by (a) reusing the whole description of the most visually similar retrieved image, or by (b) associating relevant phrases from a large collection of image and description pairs (Ordonez et al., 2016). Retrieval models can be further subdivided, based on the technique used for representing and computing image similarity. The first subgroup uses a visual space for finding related images, while the second subgroup uses a multimodal space for combining both textual and visual image information. The first subgroup Ordonez et al., 2016;Gupta et al., 2012;Mason and Charniak, 2014;Yagcioglu et al., 2015), is intended to first extract visual features from the query images. Based on a visual similarity measure dependent on the extracted features, a candidate set of related images is retrieved from a large collection of pre-annotated images. Retrieved descriptions are then re-ranked by further exploiting the visual and textual information extracted from the retrieved candidate set of similar images. Conversely, retrieving descriptions from a multimodal space is characterised by the joint space between visual and textual data constructed from a collection of image-description pairs. For example, in Farhadi et al. (2010), image descriptions were retrieved from a multimodal space consisting of < object, action, scene > tuples. More recently, deep neural networks were introduced to map images and corresponding descriptions in one joint multimodal space (Socher et al., 2014;Kiros et al., 2014;Donahue et al., 2015;Karpathy and Li, 2015;Chen and Zitnick, 2015).

Problem Definition
Image caption generators are designed to associate images with corresponding sentences, hence they can be viewed in terms of an affinity function f (i, s) that measures the degree of correlation between images and sentences. Based on a set of candidate images I cand annotated with corresponding candidate sentences S cand , typical retrieval-based caption generation methods describe an image by reusing sentence s ∈ S cand . The selected sentence is the one that maximises the affinity function f (i q , s) for a given query image i q . On the contrary, generation-based image descriptors attempt to construct a novel sentence s n composed of image entities and attributes.
The system described in this paper extracts sentences from a collection of web pages W, rather than from a limited set of candidate humanauthored image descriptions S cand , as done in most existing retrieval-based studies. Websites containing visually similar images to the query image are found using search-by-image technology. The intuition to this method is based on the fact that the evergrowing Internet-based multimedia data is a readily-available data source as opposed to the purposely constructed and limited image-description datasets used in many studies. The search for a query image can be thought of as providing a dynamic and specialised small dataset for a given query image.
The suggested framework starts by generating a simple image description based on the image visual entities and their spatial relationship. This simple description is then used as keywords to drive and optimise a web-data-driven based retrieval process. The latter is primarily intended to retrieve the most relevant sentence from the set of candidate web pages W by utilising the functionality offered by a search-by-image algorithm. This strategy is adopted under the assumption that web pages featuring visually similar images to a query image i q , can contain sentences which can be effectively re-used to describe image i q .  Figure 1: The proposed web-retrieval-based system designed in two stages. The query image i q is first described by the keywords generated by the first stage. These are then used to retrieve image descriptions from a collection of web pages W. The best sentence s b is extracted from the best text document T w b , with respect to the global word probability distribution P (T) and the query image i q .

Image Description Framework
The proposed generation-retrieval-based approach is centrally decomposed into two phases. The first generation stage of the framework is mainly intended to generate simple image descriptions that will serve as keywords for the second retrieval phase. By exploiting the vast amount of imagetext data found on the Web, the latter will then extract the most likely sentence for a given query image. A high-level overview of the proposed image description framework is presented in Figure 1.

Generation-based Image Description
The first stage of the image description generation framework analyses the image visual content and detects the most important image objects. Therefore, the aim of this step is to detect and annotate image objects with corresponding high-level image labels and corresponding bounding boxes. In order to describe the spatial relationship between the predominant image objects, various predictive models based on different textual and geometric feature sets, were investigated as described in section 4.2. From this simple generated image description, in the form of an object-prepositionarticle-object keyword structure, the framework is then designed to drive a web-retrieval-based pro-cess. This process exploits both the visual aspect of the query image, as well as the linguistic keywords generated by the first stage of the pipeline.

Preposition Predictive Model
The generation of prepositions was cast as a prediction-based problem through geometrical and encoded textual features. Four different predictive models based on separate feature sets were analysed. This experiment confirmed that the Random Forest model obtained the best preposition prediction accuracy rate. This was achieved when predicting prepositions via word2vec (Mikolov et al., 2013) textual labels combined with the geometric feature sets used by Muscat and Belz (2015) and Ramisa et al. (2015). This setup marginally outperformed the best preposition prediction accuracy achieved by Ramisa et al. (2015) when trained and evaluated on the same Visen's MSCOCO Prepositions 1 testing set having original object labels. Results can be found in Table 1.

Retrieval-based Image Description
The aim of the second phase of the proposed framework is to retrieve descriptions based on the visual aspect of a query image and its correspond- ing simple generated image description, as discussed in Section 4.1. This phase is designed to find a set of web pages composed of images that are visually related to the query image. This search functionality is freely available by the current two dominant search-engines, Google 2 and Bing 3 . These two proprietary image-search algorithms are able to retrieve visually similar images, which may therefore be used for collecting web pages with featured visually similar images. From the retrieved collection of web pages characterised with visually similar images to the query image, this phase is designed to extract the best sentence that can be used to describe the query image. Based on the idea that websites usually describe or discuss the embedded images, it is assumed that this stage is capable of finding humanlike sentences describing the incorporated images which can be re-used to describe the query images. Given a collection of candidate web pages W with embedded visually similar images, this phase is intended to extract the main text T w i from each corresponding web page w i ∈ W. This is carried out by analysing the Document Object Model (DOM) of each web page as well as by statistically distinguishing between HTML and textual data. Moreover, this stage is intended to discard any boilerplate text that is normally found in web pages, including navigational text and advertisements by exploiting shallow text features (Kohlschütter et al., 2010). After transforming the set of web pages W to the corresponding text documents T, this stage computes the word probability distribution P (T w i ) for each T w i , disregarding any stop words in the distribution. The text found in each text document T w i is combined in one text collection T and the probability distribution P (T), representing all the probabilities for the words contained in collection T, is calculated. The top k most probable words from each generated probability distribution P (T w i ) are considered to find the most probable relevant text document T w b , for the extraction of the best sentence s b that describes the query image i q . Specifically, the best text document is selected by the following maximising function over each text document probability distribution P (T w i ), with respect to the global word probability distribution P (T): where n represents the n th most probable word of the probability distribution.
This strategy is used to eliminate documents that are probably irrelevant to provide correct descriptions for query images. A similar approach is carried out to retrieve the best sentence s b that could potentially describe the query image. The technique used to select the most appropriate sentence from T w b is initiated by extracting the set of candidate sentences S cand from the selected best file T w b . The second step is to weight each sentence s i ∈ S cand by the summation over how probable each word is, with respect to the global word probability distribution P (T). Therefore, s b is retrieved by maximising the following formula: where n represents the n th word found in sentence s i ∈ S cand extracted from the best file T w b , and |s i | represents the number of words found in sentence s i .
To further enhance the contextual reliability of the selected sentence, the approach used to retrieve image descriptions is combined with the image visual aspect. This is accomplished by weighting the visible object class labels in accordance to their corresponding image predominance level. The area of the visible image entities, with respect to the entire query image i q , was used to prioritise visible image objects. Therefore, the best sentence s b is retrieved by combining the knowledge extracted from the most probable words found in P (T) and the visual aspect of the query image i q , by the following formula: where R is a function which computes the area of the object class label s i,n found in the n th word of sentence s i in the context of image i q .

Implementation
The image description generation framework was modularised and implemented in two stages. To detect the main image objects, the first stage employs the two-phased fast region-based convolutional neural network (R-CNN) proposed by Ren et al. (2015). The first module of the R-CNN is a deep fully convolutional neural network designed to propose regions, while the second module is a detector that uses the proposed regions for detecting image objects enclosed in bounding boxes. This architecture is trained end-to-end into a single network by sharing convolutional features. The deep VGG-16 model (Simonyan and Zisserman, 2014) pre-trained on MSCOCO (Lin et al., 2014) dataset, was utilised to detect image objects with corresponding class labels and bounding boxes. These were then used to infer the spatial relationship between the detected image objects as discussed in section 4.2.
By using the linguistic keywords generated from the first stage, the second part of the framework is designed to retrieve the most probable sentence from a set of relevant web pages that feature visually similar images. The set of web pages is collected by using the free functionality offered by Google's Search By Image 4 proprietary tech-nology. For a given uploaded query image, this functionality is intended to return visually similar images. Based on extracted image visual features and automatically generated textual keywords by the same functionality, Google's Search by Image retrieves visually similar images. The websites of the visually returned images are then retrieved from the corresponding URLs binded with each visually similar image. By using Selenium 5 to automate the headless PhantomJS browser, query images were automatically uploaded to retrieve websites featuring visually similar images. In this study, it was shown how object labels connected with spatial prepositions affect the retrieval search performed by Google's search-by-image algorithm. This was accomplished by replacing Google's keywords with object labels and preposition generated by the first stage of the proposed framework. Furthermore, this study also investigated whether stock photography websites could improve the retrieval search of the designed framework. The retrieval of websites featuring stock photos was achieved by concatenating the phrase "stock photos" with the keywords extracted from the visual aspect of the query image. To detect and extract the main textual content of each respective web page, the boilerpipe 6 toolkit was employed. From the set of extracted text documents, the most probable sentence that best describes the query image is then retrieved, as discussed in Section 4.3.

Dataset
To evaluate the proposed image description framework, a specific subset of human-annotated images featured in MSCOCO 7 testing set was used. Since the preposition prediction task is targeted to generate prepositions between two image objects, describing images having exactly two image objects was of particular interest to this study. Therefore, the following steps were carried out to select images consisting of two image objects. From the ViSen's MSCOCO testing set, 1975 instances having strictly one single preposition between two image objects were found and extracted. Finally, 1000 images were randomly selected from the latter subset. Since images may contain background image objects, the same object detector employed in the proposed framework was used for detecting objects. The fast R-CNN found 128 images containing one image object, 438 images containing exactly two image objects, while the remaining 434 images contained more than two image objects. For the evaluation of this framework, images composed of one and two image objects were only considered. Therefore, the framework was evaluated on a dataset consisting of 566 images, where 128 images contain one single object, while the other remaining 438 images contain exactly two image objects.

Evaluation
Both human and computational evaluation were used to evaluate the web-retrieval-based framework. The automatic evaluation was performed by using existing metrics, intended to measure the similarity between generated descriptions and corresponding human ground truth descriptions. The measures include BLEU (Papineni et al., 2002), ROUGE L (Lin and Hovy, 2003), METEOR (Denkowski and Lavie, 2014) and CIDEr (Vedantam et al., 2015). To complement the automatic evaluation, human judgments for image descriptions were obtained from a qualified English teacher. Since the human evaluation process is considerably time-consuming, human judgments were collected for a sample of 200 images split equally for single and double-object images. The same human evaluation criteria proposed by Mitchell et al. (2012) was used to evaluate the generated descriptions. Human evaluation was conducted by rating the grammar, main aspects, correctness, order and the human-likeness of descriptions using a five-point Likert scale.

Results
The framework was evaluated in each phase of its pipeline as described in Table 2. The results are given in Tables 3 and 4 for single and double-object images respectively. The generation phase of the framework that describes images with just object labels is represented by G, while the standalone retrieval-based approach which uses Google's automatic generated keywords is represented by R. Furthermore, when describing singleobject images, the joint generation-retrieval stage that uses the prototype's keywords is represented by GR. When describing double-object images, the generation-retrieval process is denoted by GPR given that it uses both object labels and prepositions as keywords. Moreover, the results obtained when the retrieval phase considers stock photography websites are denoted by the letter S. The retrieval-based stages are specified by the two parameters, W and F. The latter represents the number of text files analysed from the corresponding websites, whereas W represents the number of most probable words used for the selection of the best sentence from a set of web pages. A grid search was performed to find these parameters for each configuration. The same notation was used for the human evaluation results. Typical image descriptions generated by the proposed web-retrieval-based image caption generation system can be found in Figure 2.

Discussion
The automatic evaluation showed that singleobject images were best described by the generation-retrieval from stock photography websites (GRS). This outperformed the one-word description of the generation-based configuration  (G), as well as the retrieval-based (R) setup. The latter result confirms that the replacement of Google's Search by Image captions improved the retrieved descriptions. This concludes that more relevant images were returned by Google when replacing its automatic caption with object labels. Conversely, double-object images were best described via the generation-preposition (GP) configuration. Although replacing Google's Search By Image keywords improved the results, the simple descriptions based on object labels connected with spatial prepositions were more accurate. Automatic evaluation also confirmed that the web-retrieval approach (GRS) performs better on double-object images. This study also showed that the retrieval process performs better without using prepositions as keywords. This resulted from the fact that prepositions constrain the search result performed by Google when indexing web pages, since most descriptive text available on the Web includes verbs rather than prepositions.
The human evaluation results for the singleobject images are presented in Table 5. Particularly, generation-based (G) descriptions obtained a grammatical median score of 1, confirming that one-word descriptions do not produce grammatically correct sentences. The results also confirm that the used object detector accurately describes the dominant objects in an image. By considering the improbability of one-word human derived descriptions, this stage resulted in a low human likeness score of 2. The retrieval method applied on stock photography websites (RS) lead to grammatical improvement in the generated descriptions. Such descriptions were grammatically rated with a median score of 3. However, results show that the retrieval method decreases the relevancy of the retrieved descriptions. Despite generating grammatically sound sentences with better human-likeness, the human evaluation showed a degree of inconsistency between the descriptions and their corresponding images. When combining the generation (G) and retrieval (RS) proposed approaches, the grammar, order and the human likeness improved for single-object images.
Table 5 also demonstrates that the generationpreposition (GP) configuration generated the best descriptions when describing double-object images. Furthermore, these results also confirmed that the retrieval (RS) approach improves when replacing Google's caption with object labels. The human evaluation also established the ineffectiveness of the retrieval stage when combined with the generation-prepositions (GPRS) stage. This table also confirmed that the web-retrieval approach described double-object images better than singleobject images.
Vase and clock in a window sill.
Person launching a kite. High angle view of a person surfing in the sea.
A shot of a kitchen microwave oven.
Person on skateboard skateboarding in action sport.

(a)
Cat chasing a mouse. Young person jogging outdoor in nature. Person sleeping. Dog in hat. italy, gressoney, person jumping ski, low angle view.
(b) Figure 2: (a) Correct and (b) incorrect descriptions generated by the web-retrieval-based framework.

Conclusion and Future Work
This paper investigated the use of object labels and prepositions as keywords in a web-retrieval-based image caption generator. By employing object detection technology combined with a preposition prediction module, keywords were extracted in the form of object class labels and prepositions. The proposed retrieval approach is independent of any purposely human-annotated image datasets. Images were described by extracting sentences found in websites, featuring visually similar images to the query image. The search is aided with the use of the generated keywords. This approach was particularly effective when describing single-object images, and especially so when extracting sentences from stock photography websites. Despite the retrieval of relevant descriptions for both single and double-object images, object labels connected with spatial prepositions obtained better accuracies when describing double-object images. Although Google's Search By Image was enhanced by the replacement of its predicted image annotations with object labels, further work in using a wider variety of keywords such as verbs can be carried out to improve the results. It is also worth studying whether linguistic parsing can be used to assess the quality of sentences during the caption extraction phase to increase the likelihood of choosing better sentences.