Don’t Mention the Shoe! A Learning to Rank Approach to Content Selection for Image Description Generation

We tackle the sub-task of content selection as part of the broader challenge of automatically generating image descriptions. More speciﬁcally, we explore how decisions can be made to select what object instances should be mentioned in an image description, given an image and labelled bounding boxes. We propose casting the content selection problem as a learning to rank problem, where object instances that are most likely to be mentioned by humans when describing an image are ranked higher than those that are less likely to be mentioned. Several features are explored: those derived from bounding box localisations, from concept labels, and from image regions. Object instances are then selected based on the ranked list, where we investigate several methods for choosing a stopping criterion as the ‘cut-off’ point for objects in the ranked list. Our best-performing method achieves state-of-the-art performance on the ImageCLEF2015 sentence generation challenge.


Introduction
In recent years, there has been significant interest in developing systems capable of generating literal, sentential descriptions of images (a boy playing with a frisbee in the park). The task poses an interesting and difficult challenge for natural language generation, and is important for improved text and image retrieval. The image description task could potentially advance research and provide insights into multimodal natural language generation, e.g. building language models of how humans naturally describe the visual world.
A standard paradigm for approaching this task is to first detect instances of pre-defined concepts in the image to be described, and then to reason about the detected concepts to generate image descriptions. Thus, such approaches may involve various components of a standard Natural Language Generation pipeline (Reiter and Dale, 2000), such as document planning (including content determination), microplanning (lexicalisation/referring expression generation) and realisation.
In this paper, we concentrate on a specific subproblem in such an image description generation pipeline. More specifically, we explore the content selection problem proposed by . In this setting, object instances are assumed to have already been localised in an image. Thus, given gold standard labelled bounding boxes of object instances in an image, the task is to select the appropriate bounding box instances to be mentioned in the eventual image description that is to be generated (see Figure 1 for an example). To our knowledge, there has been minimal work specifically tackling the content selection problem. However, the task is important to image description generation as not all entities depicted in an image will be mentioned by humans. For example, a fork lying on a table probably will not be mentioned in a picture of a family having dinner in the kitchen. Determining which entity will be described thus poses an interesting research question, and may provide insights into how humans decide what is important enough to be described in an image description.
Thus, the main objective of this paper is to propose methods for learning to predict the object entities depicted in an image that will be mentioned in a human-authored description of the image. Our main contribution is to develop a ranking-based content selection system that exploits stronger tex-193  Figure 1: Given labelled bounding boxes as input, we tackle the content selection task, i.e. deciding which bounding box instances should be selected to be mentioned in the corresponding image description. This is an important task as humans do not mention everything that is depicted in an image. We propose casting the content selection problem as a ranking task, that is to order the bounding box instances by how likely they are to be mentioned in a human-authored image description.
tual and image features from data for the content selection problem, than those used in the baselines proposed in . We propose casting the content selection problem as a learning to rank problem. More specifically, given a set of labelled bounding boxes in an image, bounding boxes instances are ranked by how likely they are to be mentioned in a corresponding human description. However, as we are interested in both precision and recall, we do not require all labelled bounding boxes to be ranked; for example object instances that are unlikely to be mentioned in the description need not be ranked. Thus, we also propose various 'stopping criterion' to automatically select only relevant instances based on the rankings. Our hypothesis is that humans inherently prioritise important entities to be selected based on background knowledge and other cues, and we will thus be able to exploit this to tackle the content selection problem.

Overview
We discuss related work on the content selection problem in Section 2. In Section 3, we present our proposed approach to treat content selection as a learning to rank problem, discussing the formulation of the task (Section 3.1), features derived from bounding box localisations, concept labels and visual appearances (Section 3.2), and the various ranking algorithms explored (Section 3.3). In Section 3.4, we also propose some automatic stopping criteria to select important objects to be described from the ranking list. Experimental results are presented in Section 4, with regards to concatenating all features (Section 4.2) as well as treating individual features independently (Section 4.3). We also provide a summary of our feature ablation study in Section 4.4, and present conclusions in Section 5.

Related work
Image description generation. Various approaches have been proposed in the literature for the task of generation image descriptions, for example (Yao et al., 2010;Kulkarni et al., 2011;Yang et al., 2011;Karpathy and Fei-Fei, 2015;Donahue et al., 2015;Vinyals et al., 2015), among others. Most previous work concentrates on solving the problem 'end-to-end', that is to generate a description given an image as input. Such systems are also evaluated in an extrinsic manner, that is by comparing output image descriptions to multiply-annotated gold standard descriptions of the same image using global measures such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), Meteor (Denkowski and Lavie, 2014) or CIDEr (Vedantam et al., 2015). Whilst such evaluation methodologies are useful to evaluate image description generation systems as a whole (how similar is the generated description to human-authored descriptions?), they make it hard to identify which components of the generation process contribute to any performance gains or losses.  propose evaluating image description generation systems in a fine-grained manner, i.e. evaluating each component of the image description generation pipeline independently. To demonstrate this, they proposed the task of content selection as a precursor to generating image descriptions and performed fine-grained evaluation on this specific task. 194 Content selection. There has been some work on selecting objects that are important or interesting in an image. Elazary and Itti (2008) propose learning to predict object interestingness by the order in which objects are labelled by annotators in LabelMe. Spain and Perona (2010) propose learning to predict object importance, by asking multiple annotators (25 per image) to name 10 objects they see in each image. The annotations are then aggregated: important objects are those that are mentioned by many annotators.
Most related to our work is , who explore factors (compositional, semantic, and contextual) that can be used to predict what is being described in an image. For prediction, they focus on a binary prediction problem -is this object described? yes or no? -and treat bounding boxes as independent of each other. In our case, we treat other bounding boxes as context, as a frequently occurring object may not be mentioned when co-occurring with some other object.  tackle an inverse problem: learning to predict segments of Flickr captions (noun phrases) that are 'visual', i.e. predicting whether a noun phrase in the caption is depicted in the image.
There has also been some work on measuring image memorability (what makes an image memorable to humans?), for example, Isola et al. (2011), among others. However, most work deals with memorability at image-level, rather than object level. Dubey et al. (2015) tackle image memorability at object level, that is, what objects are memorable (worth remembering) to a person in an image. This acts as a precursor to the content selection problem of choosing what to describe in an image description. Ortiz et al. (2015) treat image description generation as a Statistical Machine Translation (SMT) task, and concentrate on describing abstract, clipart scenes. Part of their pipeline involves a content selection module where rankings of object pairs are optimised as an integer linear programming (ILP) problem, allowing object pairs that frequently co-occur and are close to each other to be ranked higher than those that are not. Our approach is not constrained to pairwise features, and automatically learns to optimise rankings across all instances directly from a training set, using arbitrary feature vectors.
Directly related to our work is , who propose some baselines for content selection assuming 'clean' visual input is provided in the form of bounding boxes labelled with concepts. The baselines are based on various textual and visual cues. We aim to move beyond these baselines and attempt to improve the performance of content selection on the same dataset used in their paper.
Learning to rank. Learning to rank is a problem common in the field of Information Retrieval. Many approaches have been proposed to learn to rank instances in a document in order of their relevance to a query. The approaches can generally be divided into three main groups: • Pointwise ranking: Each instance in a document are treated independently of each other.
• Pairwise ranking: The relative rank of pairs of instances are optimized in the objective function.
We refer readers to Li (2011) for a summary of different techniques for learning to rank.

Learning to rank object instances
In this paper, we use the dataset from the Image-CLEF 2015 Scalable Image Annotation, Localization and Sentence Generation challenge . More specifically, we tackle the 'clean track' of the sentence generation task. In this track, participants are provided with images with bounding box instances labelled with a WordNet sysnet (from 251 possible synset categories). Each image also contains 5-51 corresponding descriptions per image. Each description has been annotated with the correspondence between a bounding box instance and a textual term in the description (e.g. "man" in description refers to bounding box instance 1 in the image). There are 500 development images and 450 test images. At test time, participants are provided labelled bounding boxes as input, and are asked to produce systems capable of selecting the bounding boxes that are mentioned in the human-authored descriptions. 195

Problem definition
, and l i j is the bounding box localisation (position and size), and c i j ∈ C is the concept label for the bounding box j, and |C| = 251 is the number of pre-defined categories. Given the set of input bounding boxes B i for each image i, the eventual task is to predict the set of bounding box instances that are most likely to be mentioned in the gold standard descriptions. Casting this as a ranking task, we aim to predict the relevance of the bounding boxes, i.e. most likely to be mentioned in the gold standard, and then rank the bounding box instances by their relevance.
As a learning to rank problem, our objective is to learn, from some training data, to predict the relevance of an unseen bounding box instance for a test image, given other bounding box instances of the same image as well as features x i j derived from each bounding box instance b i j .

Features
We explore different features, derived from (i) the bounding box localisation, l i j ; (ii) the concept label, c i j ; or (iii) the visual appearance of the region in image i bounded by l i j . The features we explore are: • bboxsize: the area of the object bounding box relative to the image.
• bboxdist: distance of the centre of the object bounding box from the image centre. For this paper, we negate the distance to accommodate classifiers that assume positive linear relations.
• textiv: a 251 dimensional one-hot vector with 1 for the matching concept label and 0 for the others.
• textemb: a 300 dimensional synset embedding derived from word2vec pretrained on the Google News Dataset (Mikolov et al., 2013). As each concept label is a Word-Net synset, we further fine-tuned the embeddings to obtain synset embeddings in the original word2vec embedding space with Au-toExtend (Rothe and Schütze, 2015), where an autoencoder is learnt based on WordNet terms, lexemes and hypernym relations.
• imgemb: a 4,096 dimensional image embedding for the object region enclosed by the bounding box. For this paper we used the penultimate layer (FC7) of the 16-layer variant of VGGNet (VGG-16) (Simonyan and Zisserman, 2014). Intuitively, this feature represents the visual appearance of the region enclosed by the bounding box.
In early experiments, we experimented with using the absolute bounding box positions (x and y coordinates) as a features. However, these features yielded poor performance, and were thus discarded in subsequent experiments.
We also explore combining the features to examine the contribution of each feature, to determine which features play a role in the content selection task.

Ranking algorithms
For ranking, we consider several commonly used algorithms in the literature for Learning to Rank. We select one example from each of the group of approaches (pointwise, pairwise, listwise): • rforest: Random forests (Breiman, 2001), an algorithm using pointwise ranking. We use the implementation of random forests in RankLib 1 in this paper.
• svmrank: Ranking SVM (Joachims, 2002), an algorithm using pairwise ranking. We use the SVM rank implementation (Joachims, 2006) of Ranking SVM in this paper. A linear kernel is used for this paper. 2 • cascent: Coordinate ascent (Metzler and Croft, 2007), an algorithm using listwise ranking. In our paper, we optimise the rankings using NDCG@10 as a metric. Again, we use the implementation of coordinate ascent in RankLib.
For these algorithms, we compute the relevance score for each bounding box instance as the proportion of human-authored, gold standard descriptions that mention the concept. The task is to learn to predict the relevance score given the features in Section 3.2, and subsequently rank the bounding box instances for each image by this score. As such, this task is treated as a continuous regression problem. 3 Our intuition is that pairwise and listwise ranking algorithms would suit our task better than pointwise algorithms, as pairwise/listwise ranking implicitly considers all other object instances as context rather than treating each instance independently as in pointwise ranking. For example, a table might be important and frequently mentioned, but might not be mentioned when co-occurring with kitchen.

Stopping criteria
While the ranking process will result in a ranked list of all input object instances per images, there is a need to provide a cut-off point in the rankings for the eventual task of content selection.
From our initial experiments, we found that the number of selected object instances greatly affects the F -scores (see Section 4.1 for evaluation measure). Selecting fewer good object instances per image will raise precision at the expense of lower recall, while selecting more objects will increase recall at the expense of lower precision.  propose a fixed threshold for the maximum number of object instances to be selected, and found that selecting 3 to 4 object instances yields an optimal balance between precision and recall (the mean number of unique bounding box instances per description is 2.89 in the development dataset). However, it may be more beneficial to have a variable threshold across images depending on the number of input object instances. For example, the bigram-based feature proposed in  has an internal stopping criterion, resulting in higher overall precision when compared to other fixed length features.
Motivated by the high precision scores of the aforementioned system, in this paper we propose two variable stopping criteria: • absolute: Retaining only object instances with a predicted relevance score above a certain threshold.
• relative: Setting the cut-off point at the largest difference in relevance scores. 3 We also experimented with ordinal regression, where regression scores are partitioned into a set of integers {0,1,2,3,4} based on the relevance score (with 4 being the most relevant). We found performance to be lower, in general. Thus, we only report results for continuous regression.
In the former case (absolute), we first normalise the predicted score across bounding boxes per image, where the highest-ranked bounding box is assigned a score of 1 and the lowest-ranked a score of 0. We retain only bounding box instances where the normalised predicted score is above a threshold (0.5 in our experiments).
The motivation for the latter case (relative) stems from our observation that the relevance scores in the development set reduces dramatically once the most important object instances are selected. For example, the most relevant object instances may have a relevance score of 0.9 and 0.8 followed by 0.2. Thus, a suitable cut-off point would be between 0.8 and 0.2. Cutting off at the point that immediately precedes the biggest difference in scores (after 0.8 in the example above) we refer to as relative1 in our experiments. We also found that cutting off the ranked list after the point that follows the largest difference in score (after 0.2 in the example above) produces a marginally higher F -score (increased recall at the expense of precision). We therefore also report the results for this as a variant, which we refer to as relative2.

Evaluation measure
Following the convention of the ImageCLEF2015 Sentence Generation challenge, we evaluate content selection using the fine-grained evaluation metric proposed in  and . More specifically, we measure the F -score (including P recision and Recall) when comparing the object instances selected by our system to the object instances mentioned in the gold standard human-authored image descriptions. The human upper-bound is estimated by evaluating one description against the other descriptions of the image and repeating the process for all descriptions.
We compare our results to the winning participants of past ImageCLEF challenges. RUC 2015  achieved the best performance in the 2015 edition  with high precision, but used an external image description dataset to train their joint CNN-LSTM image captioning system, and performed content selection in a retrospective manner. DUTh 2016 (Barlas et al., 2016)   using a binary SVM classifier with bounding box localisation and visual features. We also compare our performance to the best reported results in   (W&G 2015), namely by combining bigram and bounding box size priors with a stopping criterion of k = 3.

Combining features
We first report the results of concatenating all features (Section 3.2) as a single vector, and compare the performance of the various ranking algorithms (Section 3.3) and stopping criteria (Section 3.4). The intuition is that the ranking algorithm will perform automatic feature selection to select the most discriminative features useful for predicting the relevance score. Table 1 shows the results of using a combination of all features. The pointwise ranking based Random Forests classifier performs best overall, achieving an F -score of 0.70, close to the human upper-bound of 0.74. This significantly exceeds the previous state-of-the-art result on the same training and test data of F = 0.56, as reported in . The coordinate ascent ranker and Ranking SVM achieved comparable scores, the latter perhaps having a slight edge.  The performance of the various stopping criteria seems to be dependent on the ranking algorithm. The absolute stopping criterion seems to be sensitive to the type of ranking algorithm. As expected, relative1 achieved higher precision than relative2, whereas relative2 achieved better recall with the additional object instance being selected.
In an earlier experiment, we have explored combining only features derived from bounding box localisation and concept labels, excluding image region features (imgemb). Interestingly, we found better performance by excluding image region features for cascent and svmrank, but not much difference for rforest (compare Table 1 and Table 2). This is very likely because the high dimensional image features (4,096D) dominated the ranking decisions for these rankers, compared to rforest which seemed less affected by the imbalance. The performance of cascent and svmrank in Table 1 is similar to that of using only image region features (c.f. Table 5, to be discussed later), further confirming our suspicion.

Individual features
We now explore each feature individually to investigate the contributions of each. Table 3 shows the results for the features derived from bounding box localisation (bboxsize and bboxdist). The same scores are obtained from both cascent and svmrank, possibly because both these features are single dimensional vectors. rforest requires higher dimensionality to operate, and as such is unable 198  Table 3: Mean P recision, Recall and F -score for features derived from bounding box localisation. Both cascent and svmrank return the same scores (shown). rforest is unable to handle single dimensional vectors. The results for k=3 and k=4 are comparable to .
to handle these one-dimensional features. The results are consistent with what was reported by  -that whilst both bboxdist and bboxsize show that content selection is dependent on these features, bboxsize is a better predictor for an object being selected compared to bboxdist. 4 Table 4 shows the results for features derived from concept labels (textiv and textemb). For these three rankers, textemb seems to outperform textiv. The only exception is for cascent when the stopping criterion is absolute, where textiv seemed to give better precision than textemb. Comparing Table 3 and Table 4, we can see that features derived from concept labels are stronger predictors for content selection. Table 5 shows the results of using only image region features (imgemb). Here, cascent does not perform as well as svmrank and rforest, due to the high dimensionality of the CNN embeddings. The performance of image region features seem to be on par with features derived from concept labels (Table 4), and better than bounding box features (Table 3). Noteworthy is how image region features yield higher recall than other features in general, at the expense of lower precision.

Feature ablation
We also performed a feature ablation study to gain insights into which features are important to content selection and the interaction between the features. This is done by testing different combinations of features to investigate which features con-4 This was demonstrated in the errata provided by  after the paper was published.  tribute better to the overall performance and thus play a bigger role for content selection.
Because of space constraints, we only provide a summary of interesting observations. Table 6 shows the F -scores for the rforest ranker with the absolute stopping criterion. We found that the features based on concept labels are dominant and influential in our experiments compared to those based on bounding box localisation or visual appearances. Combining textiv and textemb alone already yielded an F -score of 0.67. This demonstrates that semantic concept labels are the best predictors for content selection. Adding bboxsize to imgemb improves the F -scores marginally, suggesting that the object size does play some role on top of visual appearances in selecting important objects. We also found that for rforest rankers, textemb plays a larger role in predicting content selection compared to textiv, as evidenced by a greater drop in F -scores when omitting textemb compared to textiv. 199

Discussion
We observed that the pointwise-based random forests ranker performs better than the pairwise and listwise-based rankers. This is surprising as we expected either pairwise-or listwise-based rankers to perform better than pointwise-based rankers, which treat each instance in a document as independent without considering other instances within the same document. It still remains unclear whether this is due to the random forests classifier itself being strong or that context plays a lesser role in content selection for this particular dataset. Further work is required to ascertain this.

Conclusion
We explored the content selection problem of deciding what needs to be mentioned in the description of an image, given labelled bounding boxes as input. We proposed casting the problem as a learning to rank task, where object instances that are more likely to be mentioned in humanauthored descriptions are ranked higher than those less likely to be mentioned. Several features are explored: those derived from bounding box localisations, concept labels and visual appearances for each object instance. We also proposed methods to automatically estimate a cut-off point in each ranked list, to select only object instances that are likely to be mentioned in the image description. Our method showed excellent results, achieving the state-of-the-art F -score of 0.70 on the Image-CLEF2015 content selection dataset, substantially out-performing the highest figures previously re-  ported on this test set. We also found that for the proposed features, those that are derived from the concept labels are better predictors for the content selection task than those derived from bounding box localisations or visual appearance of regions. The proposed learning to rank approach is general enough and may also be relevant to content selection tasks in other areas of natural language generation. Future work could include exploring even stronger features. There is also scope to automatically gather a larger noisy dataset to enable more robust learning and reduce reliance on annotating training data. We hope that these additions will further improve the content selection capabilities of the proposed system.