VIFIDEL: Evaluating the Visual Fidelity of Image Descriptions

We address the task of evaluating image description generation systems. We propose a novel image-aware metric for this task: VIFIDEL. It estimates the faithfulness of a generated caption with respect to the content of the actual image, based on the semantic similarity between labels of objects depicted in images and words in the description. The metric is also able to take into account the relative importance of objects mentioned in human reference descriptions during evaluation. Even if these human reference descriptions are not available, VIFIDEL can still reliably evaluate system descriptions. The metric achieves high correlation with human judgments on two well-known datasets and is competitive with metrics that depend on and rely exclusively on human references.


Introduction
A popular task at the intersection of computer vision and natural language is image description generation (IDG), i.e. the task of generating as output a sentence describing the visual content of a given input image. While a variety of methods have been proposed for this task Vinyals et al., 2015), its evaluation is still an understudied problem. Evaluation of IDG is currently performed in two ways: (i) human judgment; (ii) automatic metrics. Human judgments evaluate either the overall quality of descriptions or specific criteria in isolation (relevance, fluency, etc.). Such methods, however, can be subjective and expensive to scale.
Automatic metrics address the scalability issue by comparing candidate descriptions against human-authored reference descriptions. These metrics conflate various criteria implicitly into a ⇔ Pranava Madhyastha and Josiah Wang contributed equally as joint first authors to this work. single evaluation assumption, i.e. a good description is one that is similar to one or more humanauthored descriptions, presuming that these gold descriptions are fluent, correct and relevant to the image. Existing automatic metrics are thus useful for measuring the quality of descriptions as a whole, but this makes it difficult for the specific capabilities of IDG systems to be inspected.
We argue that a fine-grained metric measuring specific criteria would be more useful in understanding how an IDG system is better than another. We focus on one such criterion, visual fidelity 1 . This criterion aims to measure how faithful a description is with respect to what is depicted in the image (i.e. systems should be rewarded for describing elements depicted in the image and penalised for describing things that are not depicted). For that, we propose to take image content into account when evaluating descriptions, in contrast to previous work ( §2) that rely solely on words in the reference descriptions. Given that most datasets are crowd-sourced, reference descriptions may not always accurately describe the image (describing non-depicted objects, not mentioning relevant objects, e.g. Figure 1). For reliable evaluation, multiple references are needed. Image annotations (with objects, attributes, relations, etc.), on the other hand, are arguably more general and less ambiguous for evaluating visual fidelity: they require a single annotation per image, and are not affected by language or style preferences. To our knowledge, no existing metric for IDG has images factored explicitly into the evaluation process.
Our main contribution is therefore an automatic evaluation metric for IDG that measures the fidelity of image descriptions with respect to the image, using information derived from images directly as 'reference' (Figure 2). We name it VIFIDEL (VIsual Fidelity for Image Description EvaLuation). VIFIDEL can be used (i) based only on images as reference (no textual references) ( §3.2), or (ii) in conjunction with textual references to take into account the relevant image content people describe ( §3.3). In addition, VIFIDEL performs matching of images and text in an embeddings space, thus drawing both modalities together semantically while avoiding the pitfalls of mainstream metrics that rely on exact or approximate string matching. This is done by building on the Word Mover's Distance (WMD) metric, which measures the distance between two texts in a word embeddings space. Another contribution is the extension of WMD to allow for multiple references to be used to model object importance, i.e. an approach for consensus within WMD. We evaluate the performance of VIFIDEL against human judgments on two popular IDG datasets ( §4).

Background
Various IDG metrics have either been adapted from other fields or proposed specifically for IDG. Examples of the former include BLEU (Papineni et al., 2002) and Meteor (Denkowski and Lavie, 2014) from machine translation evaluation, and ROUGE (Lin, 2004) (more specifically ROUGE L (Lin and Och, 2004)) from text summarisation evaluation. Metrics designed specifically for evaluating image descriptions include CIDEr (Vedantam et al., 2015), SPICE (Anderson et al., 2016) and BAST (Ellebracht et al., 2015).
One main weakness of most metrics (BLEU, ROUGE, Meteor, CIDEr) is that they rely on exact string matching to measure the surface-level, ngram overlap between candidate texts and human references. This can result in data spasity problems, especially with limited references. Meteor partially addresses this by matching synonyms from dictionaries and paraphrase tables, but it is constrained to the availability of such dictionaries, making it hard to scale to different languages. Metrics like SPICE and BAST address the issue of exact string matching by measuring similarity on a semantic level. However, these methods rely heavily on linguistic resources such as parsers, semantic role labellers, tailored rules, etc., making evaluation difficult to scale or adapt to different languages and domains. Word Mover's Distance (WMD) (Kusner et al., 2015) has also been proposed as an IDG metric (Kilickaya et al., 2017). WMD finds optimal alignments between word embeddings in candidate and reference descriptions instead of performing n-gram matching to address data sparseness issues. VIFIDEL is inspired by WMD, but goes beyond using reference texts by comparing candidate texts against image content ( §3.2).
Using images for IDG evaluation. As far as we are aware, no previous work uses images for evaluating IDG. The closest related work is that by Wang and Gaizauskas (2015), who propose an f -measure-based metric to evaluate the task of selecting relevant object instances to be mentioned in a description. The metric computes the overlap between selected object instances and objects mentioned in references, averaged over multiple references. The averaging process implicitly captures consensus over which objects should be mentioned ( §3.3), i.e. objects mentioned in more references should be more important than those mentioned in fewer. Their work, however, requires manual correspondence annotations between bounding box instances and object mentions in descriptions. Our proposed method leverages word embeddings to circumvent the need for exact correspondence annotations.
Complementarity of metrics. Kilickaya et al. (2017) report that the n-gram based metrics, the semantic graph based SPICE, and the embeddingbased WMD capture complementary information, and that linearly combining Meteor, SPICE and WMD gives a better correlation score against human judgments. Similarly, Liu et al. (2017) optimise image description generation to capture both semantic faithfulness and syntactic fluency by combining metrics with complementary properties. However, the combination weights need to be engineered towards the task.

VIFIDEL
In this section we describe the VIFIDEL metric. It is inspired by Word Mover's Distance (WMD) ( §3.1), which measures the distance between two documents in a word embedding space. VIFIDEL however explicitly incorporates semantic information derived from images ( §3.2), which can be used with or without reference descriptions ( §3.3), using word embeddings as a bridge for matching content in images to content in textual descriptions. In contrast to WMD which compares pairs of documents, VIFIDEL also allows for multiple references to be taken into account with a consensus-based approach.

Word Mover's Distance (WMD)
WMD (Kusner et al., 2015) makes use of word vector relationships between word embeddings to compute the distance between two text documents. WMD captures the minimal distance required to move words from the first document to words in the second document. Let X ∈ R N ×K be a matrix, with Kdimensional word embeddings for a vocabulary of N words. Let x i ∈ R K be a K-dimensional word embedding vector for word i. A document ∆ is represented as an N -dimensional normalised bagof-words (BOW) vector, i is the normalised frequency of word i occurring in document ∆. Stop words are removed from documents; only content words are retained. Kusner et al. (2015) state that stop words are generally less relevant for capturing semantic similarity between documents, especially for bagof-words representations.
As a measure of word-level dissimilarity, Kusner et al. (2015) propose the word travel cost, that is the cost of moving from word i to word j, using the Euclidean distance between the embeddings corresponding to words. More precisely, the cost is defined as: where p is usually 1 or 2 (we set p=2). This allows documents with many closely related words to have smaller distances than documents with very dissimilar words.
To measure distances between two documents α and β, WMD defines a transport matrix T ∈ R N ×N , where T ij contains information about the proportion of word i in d α that needs to be transported to word j in d β . Formally, WMD computes T that optimises: Here, the normalised bag-of-words distribution of the documents d α and d β contains a combined vocabulary from d α and d β resulting in a square transport matrix T of dimensionality N ×N . Kusner et al. (2015) note that WMD is a special case of Earth Mover's Distance (Rubner et al., 2000), popular in the computer vision community, or Wasserstein's Distance (Datta et al., 2008), popular in the optimal transport community.

Using objects as image information
In this paper, we explore image information in the form of explicit object detections, both using gold or predicted object instances for a given image. Previous works (Yin and Ordonez, 2017; have found explicit object detections to be informative for image description generation. Thus, we base our intuition on the hypothesis that a thorough and true description of the image should consist of information about objects and their interactions (frequencies, etc.) in the environment. VIFIDEL has the capacity to capture these.
While objects represent one important type of semantic image information, VIFIDEL can potentially incorporate other semantic image information including attributes, actions, positions, scenes, relations between objects, and more finegrained information such as colour. For this paper, we consider only the frequency of depicted object category instances as semantic information, and regard further enrichment as future work.
As mentioned, a motivation for using image information for evaluating image descriptions is that reference descriptions can be subjective, ambiguous and may or may not specify all (and only) the important elements of the image. In fact, they often focus on a subset of the image content. Using object labels can minimise these issues. In addition, as shown in Figure 1, references are not always actual descriptions and can be incorrect.
Another advantage of a metric based on objectlevel information is the cost of collecting data: either objects are predicted (no labelling involved) or, if gold labels are to be used for more reliable results, object annotations can be gathered in more trustworthy ways using a single annotation per image. With descriptions, it has been shown that multiple descriptions per image are needed for reliable evaluation . Figure 2 gives an intuitive illustration of our metric. The top row shows a set of detected object instances in the image. The bottom row shows content words in a system description that are semantically very similar to the detected objectsdog, toys, etc. The description does not mention all detected objects (e.g. it misses table) and contains the word beach that is not in the image. VIFIDEL aims to capture these discrepancies.
More specifically, VIFIDEL is defined as the similarity between the semantic content in image I and a description S. It is specified by the inverse of the minimum cumulative cost required to move semantic labels (e.g. object categories) from image I to words in the description S. This converts WMD from a distance measure to a similarity measure. Formally: where d S is the normalised bag of words representation for description S ( §3.1), and d I is a semantic vector representation for image I, specifically a normalised bag of object category labels ( §4.1). WMD is defined in Eq. 2.
In its basic form, VIFIDEL can provide information about the compatibility between an image and a description without using reference descriptions. We show the performance of VIFIDEL in the absence of reference descriptions in §4.

Modelling object importance with reference descriptions
We now expand the basic version of VIFIDEL to use (one or more) human references when avail-able. Human references allow for capturing the human-likeness aspect of descriptions, that is, they capture what humans consider important to be described for a picture (Berg et al., 2012).
VIFIDEL can use human references as additional guidance to determine the importance of object content in a given image. We exploit the fact that each reference may only describe a particular subset of the image content, and we assume that important objects are mentioned more frequently across references than less important ones. Our proposal is similar to CIDEr in that we capture consensus information given a set of human references. However, we significantly differ in that (i) we use the references to explicitly model object importance, instead of directly comparing the candidates against the references; (ii) we perform word matching in a semantic space using word embeddings rather than surface forms.
VIFIDEL also differs from previous approaches using WMD for image description evaluation (Kilickaya et al., 2017), where the metric is only computed using the description and the single closest reference. One of the problems in such an approach is the biased choice of reference: in this case, the reference with the smallest WMD distance from the system description. A better reference may be available, e.g. mentioning more image content, which would lead lower system scores in an unbiased evaluation. This has been a common problem in metrics based on a single reference (Fomicheva and Specia, 2016).
Another contribution in this paper is therefore the incorporation of an object importance model into the WMD framework using human references. Our approach rewards candidate descriptions that mention objects depicted in the image (i.e. faithful to image content), and that the objects are also mentioned frequently across all references (i.e. they mention important objects). Intuitively, the WMD cost function (Eq. 1) is replaced with a weighted Euclidean distance. These weights are derived from human descriptions. While the original cost function captures the faithfulness of candidate descriptions to depicted objects, the weights extend the function such that the cost is lower for words that are mentioned frequently across references (either as an exact match or a semantically similar match), and higher for those mentioned less frequently. The weights are applied to both the object labels from images and words for candidate descriptions.
Formally, let R I = (R I 1 , R I 2 , . . . , R I M ) be a set of M human references for image I. The perimage penalty weight, ρ I k , for a word k (an object label in image I or a content word in a candidate description S I ) is computed as: (4) where {R I r } is the set of content words in the rth reference for image I, and x t the word embedding for word t. The denominator 2 ensures that w I k is always in the range [0, 1].
For each image I, we compute the penalty ρ I k for each k ∈ {t|d I t > 0} ∪ {t|d S I t > 0}, i.e. the union of all labels for objects depicted in I and all content words in the candidate description S I to be evaluated. Thus, ρ I k is the effective cosine distance (∈ [0, 1]) between each word/object label k and its most similar content word in each human reference, averaged over all references for the image. The averaging process implicitly captures the consensus over which objects should be mentioned. ρ I k will be small for words/object labels that can are mentioned across most references (using the exact word or a semantically similar word), and large for those that are mentioned only by a few.
The proposed approach of integrating object importance replaces the cost c(i, j) in Eq. 2 with a weighted cost c (i, j|R I ) to move from word i to word j given references R I : The updated Eq. 2 is then used in Eq. 3 to compute a VIFIDEL score weighted by object importance. Figure 3 illustrates a concrete example of VIFIDEL's object importance model using human references, showing how the cost c (i, j|R I ) is calculated.

Experiments
We experiment with VIFIDEL in two datasets: the PASCAL-50S Consensus dataset ( §4.2) and human ratings on MSCOCO ( §4.3). We compare VIFIDEL against commonly used IDG metrics.

Visual annotation and detectors
We test the performance of the following metric variants, where VIFIDEL gold and VIFIDEL D500 and their union are reported in the main experiments ( §4.2 and §4.3), while the remaining are used for an ablation study ( §4.5): • VIFIDEL gold : This variant uses gold standard, object-level annotations provided by the respective datasets. For the PASCAL-50S dataset, we use the annotations for 20 pre-defined object categories (person, car, cow, etc.) provided by the PASCAL VOC challenge (Everingham et al., 2015). For MSCOCO, we use annotations for 80 object categories provided by MSCOCO (Lin et al., 2014). In both datasets, the reference descriptions are sourced independent of the image annotations; thus there is no direct correspondence between the visual annotations and the descriptions.
• VIFIDEL D80 : This variant uses the output of an object detector pre-trained on the MSCOCO dataset, for 80 MSCOCO categories. We use the TensorFlow Object Detection API (Huang et al., 2017) for this purpose 2 . We set 0.6 as the confidence threshold for detected objects.
• VIFIDEL D500 : This variant uses the output of an object detector, pre-trained on the Open Images dataset (Krasin et al., 2017) with bounding box annotations for 545 object categories. Again, we use the Tensor-Flow Object Detection API 3 , and set the confidence threshold to 0.4.
• VIFIDEL gold ∪ D500 : We combine the outputs of the gold annotation and D500 detector and use unique object labels from the combination.
In this paper, we use only the output labels of the detectors, and represent the content of an image I as a vector of normalised frequencies over object labels, d I . A discussion on how the performance of the metric can vary according to the quality of the objects available is given in S4.5. The ideal setting would count on a comprehensive list of objects given by humans.
3 faster rcnn inception resnet v2 atrous oid. In this example, we compute the weights ρ I k for encyclopedias, cat and books. The most similar word to encyclopedias in each reference description include books, notebook and binder according to the cosine distance between their word embeddings. These are averaged to obtain a consensus penalty score. The word cat has a low penalty score because it has either an exact match or a very close semantic match (kitten) to all references. The penalty scores are then used as weights to compute the cost c (book, encyclopedias|I) between the object label book from the image and the word encyclopedias in the candidate description.

Accuracy on PASCAL-50S
In this section, we focus on the PASCAL-50S dataset and tackle the binary forced-choice task of predicting: "which description is more similar to A: B or C?", as proposed by . We focus on the variant comparing two machine generated captions. The dataset contains multiple crowdsourced image description for each of 1,000 images from the UIUC PASCAL dataset (Rashtchian et al., 2010).
Evaluation of system outputs as relative rankings has long been established as the best practice in many fields where language outputs are produced and no single correct output exists. The WMT yearly evaluation campaigns for machine translation (Bojar et al., 2016), for example, have argued that relative ranking leads to more reliable judgments than absolute scores. We therefore consider our findings on this dataset as the most important.
For the binary forced-choice task, Vedantam et al. (2015) collected 48 descriptions A per image, and formed pairs of descriptions B and C from machine generated descriptions and/or the remaining two human descriptions. This corresponds to the (MM) setting of the dataset: comparing two machine generated descriptions. Arguably, this is most interesting subtask from a practical point of view, which is the setting in which evaluation is generally performed.
The dataset thus consists of 1,000 (B,C) pairs. Crowd-sourced gold standard annotations were  Table 1: Accuracy of VIFIDEL on the PASCAL-50S binary forced-choice task using 0, 1, 5 and 48 references, comparing two machine generated descriptions. provided for each 48×1000 (A,B,C) triplet, and the 'consensus' binary label per (B,C) pair was obtained by majority vote across 48 references. We also provide results on the other splits in the Appendix A. Table 1 presents the accuracies for the binary forced-choice task for VIFIDEL, compared to the most commonly used IDG metrics: BLEU 1 , BLEU 4 , ROUGE L , ME-TEOR, CIDEr-D, and SPICE 4 . We also report results with the standard WMD as the metric and an RNN language model trained on training captions from MSCOCO (Chen et al., 2015). Note that standard WMD is already a strong metric, but it relies on references, like all other metrics in the table and different from VIFIDEL. VIFIDEL only uses references to up-weight or down-weight the matches between image objects and words in the captions. Notable is the metric's performance based solely on image information: VIFIDEL can distinguish between a correct and incorrect description -the difficult task of differentiating between two machine generated descriptions, agreeing with human judgments more often than any other IDG metric that uses a reference. With one reference description, VIFIDEL gave the highest accuracy, suggesting that image-side information is indeed helpful for evaluating the quality of image descriptions when comparing two machine generated descriptions.

Results and discussion
Even when we take more references into account, VIFIDEL is as good or better than reference-based metric. With 5 references (a feasible number from a practical perspective), VIFIDEL achieves the highest accuracy with a slight improvement in score over zero or one reference. Thus, the object weighting scheme can be seen to help VIFIDEL focus on which objects are more important. With 48 references (the maximum possible, only available in PASCAL-50S), VIFIDEL is still more accurate than existing metrics, showing that visual fidelity is an important factor for rating machine generated descriptions. Figure 4 depicts accuracies over an increasing number of reference descriptions, starting from zero references (only defined for VIFIDEL). We note that VIFIDEL is more stable and consis-tently outperforms other metrics for all numbers of references. VIFIDEL D500 has a very slight advantage over VIFIDEL gold , most likely because the visual information is richer in the former.
Overall, we conclude that measuring visual fidelity is important especially for ranking two machine generated descriptions, arguably the most important evaluation setting, and that VIFIDEL can measure visual fidelity by explicitly using information derived from images, particularly when few or no references are available.

Correlation with human judges
We measure correlation with human judgments on the MSCOCO portion of the COMPOSITE dataset (Aditya et al., 2018). The dataset contains 2, 007 candidate images from the MSCOCO dataset (Lin et al., 2014) with annotations from AMT workers. These annotations consist of judgments on a scale of 1 (low) to 5 (high) regarding (i) relevance of a description to the image and (ii) thoroughness of a description given an image. Candidate descriptions were sampled from human references and image description systems. To make our settings closer to those of real system evaluations, we consider only system generated descriptions as candidates and evaluate the Spearman's correlation between the automated metrics and human judgments. As mentioned in §4.2, absolute human judgments, especially for subjective tasks such as this, can be very subjective and therefore less reliable, especially when a single judgment is collected per description.
Results and discussion We summarise our results in Table 2. For these experiments, we report gold ∪ D500 detectors for object information, as this was the best overall performance (non-gold and other variants can be seen in §4.5). Our key finding is that VIFIDEL using no reference descriptions obtains comparable (albeit still lower) correlation to metrics like BLEU and ROUGE with only one reference description. The gap between VIFIDEL and such metrics is the inability of the former to capture fluency, since it relies essentially on bag of word embeddings.

Combining VIFIDEL with other metrics
As mentioned, VIFIDEL is a fine-grained metric that exclusively evaluates the visual fidelity of an image description. As such, by design, an IDG system may achieve high VIFIDEL scores sim-  ply by listing all objects depicted in the image. We do not consider this an issue, since we are in effect evaluating what the image description describes (visual fidelity), rather than how the image is being described (fluency). The latter can instead be evaluated separately with a metric designed specifically for the purpose. Thus, a system that simply lists all objects will result in a high VIFIDEL score but a low fluency score. This is in line with our vision of evaluating the specific capabilities of IDG systems to clearly understand how one system is better than another. While this is not the aim of the paper, we also demonstrate how VIFIDEL can also be combined with fluencybased metrics to evaluate image descriptions as a single conflated metric. More specifically, we explore combining VIFIDEL with two different fluency-based strategies: an RNN language model and CIDEr (other metrics are possible, CIDEr provides a good compromise between performance and efficiency). In both cases, we simply averaged the scores of the two metrics. The results are shown at the bottom of Tables 1 and 2. On average, the addition of the fluency-based metric is complementary. VIFIDEL +CIDEr is better performing than VIFIDEL +LM. This is expected as LM only provides perplexity scores given a description, while CIDEr explicitly measures quality against references. We note that better combinations can potentially be achieved by learning weights for a weighted average and optimising the training towards fluency.

Ablation studies
Effect of object detectors and frequency counts: Here we study the effect of using different ob-  ject detectors. We also investigate the contribution of frequency counts in d I , by binarising the frequency counts to indicate only the presence and absence of the objects. The hypothesis is that the number of object instances may be useful for evaluating visual fidelity. The results are summarised in Table 3. The combination of gold and D500 object detectors performed the best for the dataset. The gold object information is only slightly better than D80 prediction-based object information. Interestingly, D500, which is more fine-grained, performs as well as the 80-category gold object information. Using a binarised d I seemed to give comparable correlation, perhaps even with a marginal edge over its frequency-based counterpart. We postulate that this could be because frequency counts are likely to be mentioned in the descriptions via quantifiers and other morphological and typological variants, which cannot be easily mapped to the frequency of detected objects.
Effect of number of detected objects: We used a pre-trained captioning system (Anderson et al., 2018) to generate captions on a sample of images from the MSCOCO validation set that have different gold object annotations. We present two ex-   amples in Table 4. In the first example the image contains only one object. Here VIFIDEL relies both on the object and on the semantic similarity between the caption and references. In the second example there are fifteen objects, however some are more important than others for describing the images. VIFIDEL gives higher importance to objects that are mentioned in the references. In Figure 5 we further explore the third example from our ablation studies by computing VIFIDEL for different subsets of object annotations, ranging from one object to fifteen object annotations. We see that, as the number of objects increase from one to two, the VIFIDEL score also increases, given that these objects are also mentioned in the system caption. However, with more objects the scores go down, since these are not mentioned in the caption, but the decrease is gradual, even with 15 objects. This happens because such additional objects do not seem too relevant to humans, as these are mostly not mentioned in the references. Therefore, the scores from VIFIDEL change with both number of annotations and the reference descriptions, while metrics like METEOR (0.354) and SPICE (0.320) would remain constant.
Effect of word representations: We also studied the effect of various pre-trained embeddings and found that the pre-trained model of word2vec 300-dimensional CBOW embeddings (Mikolov et al., 2013) is slightly better than GLoVe (Pennington et al., 2014) and FastText embeddings (Joulin et al., 2017). This could be because of the amount of data on which these were trained.
FastText embeddings had similar performance as word2vec embeddings even when only trained on the Wikipedia as corpus. For consistency, we used the word2vec embeddings pre-trained on Google News.

Conclusions
We have introduced a new metric for image description evaluation that goes beyond comparing descriptions to human references and is explicitly based on object-level image information. Our hypothesis is that the use of image information provides a more reliable pathway for measuring the fidelity of a description for a given image. Further, the metric relies on off-the shelf object detectors and word-embeddings and computes the scores in a semantic space. Our analysis on two of the most widely used datasets for metric comparison shows that our metric correlates well with human judgments, and is particularly well suited when few or no reference description is available. The metric performs comparatively for gold and predicted annotations on objects and is lightweight in terms of dependency on linguistic resources. Our implementation of VIFIDEL can be accessed from: https://github.com/ ImperialNLP/vifidel -   Table 6: Accuracy of VIFIDEL and other IDG metrics on the PASCAL-50S binary forced-choice task using 5 and 48 references.