REO-Relevance, Extraness, Omission: A Fine-grained Evaluation for Image Captioning

Popular metrics used for evaluating image captioning systems, such as BLEU and CIDEr, provide a single score to gauge the system’s overall effectiveness. This score is often not informative enough to indicate what specific errors are made by a given system. In this study, we present a fine-grained evaluation method REO for automatically measuring the performance of image captioning systems. REO assesses the quality of captions from three perspectives: 1) Relevance to the ground truth, 2) Extraness of the content that is irrelevant to the ground truth, and 3) Omission of the elements in the images and human references. Experiments on three benchmark datasets demonstrate that our method achieves a higher consistency with human judgments and provides more intuitive evaluation results than alternative metrics.


Introduction
Image captioning is an interdisciplinary task that aims to automatically generate a text description for a given image.The task is fundamental to a wide range of applications, including image retrieval (Rui et al., 1999) and vision language navigation (Wang et al., 2019).Though remarkable progress has been made (Gan et al., 2017;Karpathy and Li, 2015), the automatic evaluation of image captioning systems remains a challenge, particularly with respect to quantifying the generation errors made by these systems (Bernardi et al., 2016).
Existing metrics for caption evaluation can be grouped into two categories: 1) rule-based metrics (Papineni et al., 2002;Vedantam et al., 2015) that are based on exact string matching, and 2) learning-based metrics (Cui et al., 2018; Sharif et al., 2018) that predict the probability of a testing caption as a human-generated caption by using a learning model.In general, prior work has shown that description adequacy with respect to the ground truth data is a main concern for evaluating text generation systems (Gatt and Krahmer, 2018).Though this aspect has been emphasized by prior work for assessing image captions (Papineni et al., 2002;Banerjee and Lavie, 2005;Gao et al., 2019), one common limitation of existing metrics is the lack of interpretability to the description errors because existing metrics only provide a composite score for the caption quality.Without fine-grained analysis, the developers may not be able to understand the specific description errors made by their developed captioning systems.
To fill this gap, we propose an evaluation method called REO that considers three specific pieces of information for measuring each caption with respect to: 1) Relevance: relevant information of a candidate caption with respect to the ground truth, 2) Extraness: extra information of a  candidate caption beyond ground truth data, and 3) Omission: missing information that a candidate fails to describe from an image and humangenerated reference captions.Figure 1 shows a comparison between existing metrics and our proposed metrics that measure caption quality at a fine-grained level.If we view caption generation as a process of decoding the information embedded in an image, we can evaluate an image captioning system by measuring the effectiveness of the decoding process in terms of the relevance of the decoded information regarding the image content, and the amount of missing or extra information.Using both the images and reference captions as ground truth information for evaluation, our approach is built based on a shared image-text embedding space defined by a grounding model that has been pre-trained on a large benchmark dataset.Given a pair of vectors representing a candidate caption and its ground truth (i.e., the target image and associated reference captions), respectively, we compute the relevance of the candidate caption and ground truth based on vector similarities.By applying vector orthogonal projection, we identify the extra and missing information carried by the candidate caption.Each aspect that we consider here (i.e., relevance, extraness, omission) is measured by an independent score.
We test our method on three datasets.The experimental results show that our proposed metrics are more consistent with human evaluations than alternative metrics.Interestingly, our study finds that human annotators pay more attention to extra or missing information in a caption (i.e., false positive and false negatives) than the caption's relevance for the given image (true positives).We also find that considering both image and references as ground truth information is more helpful for caption evaluation than considering image or references individually.

Methods
Figure 2 provides an overview for the calculation of REO, which happens in two stages.The first stage is feature extraction, where we aim to obtain feature vectors to encode the candidate caption C and corresponding ground truth G for further comparisons.The second stage is to measure three metric scores.Specifically, we measure relevance using standard cosine similarity.To measure irrelevance (i.e., extraness and omissions), we compare the information carried by C and G, respectively.We will give a detailed description of our method in the following two subsections.

Feature Extraction
Following Lee et al., we leverage a pretrained Stacked Cross Attention Neural Network (SCAN) to build a multi-modal semantic space.Specifically, we obtain word features by averaging the forward and backward hidden states per word from a bidirectional GRU (Bahdanau et al., 2014), where τ = {C, R} denotes either a candidate C or a reference R sentence of M words.Based on Anderson et al., we achieve image features U ∈ R N ×D by detecting N salient regions per image (N = 36 in this paper).A linear layer is applied to transform image features to D-dimensional features Based on the SCAN model, we further extract the context information from the caption words for each detected region.To this end, we compute a context feature a τ i for the i th region by a weighted sum of caption word features in Eq. (1).Notice that extracts the context information of the caption τ with respect to all regions in the image. (1) where λ is a smoothing factor, and sim(v, h τ ) is a normalized similarity function defined as

Metric Scores
In order to explore the impact of image data on evaluation, we focus on comparing context features where G denotes either image features V or the context features of R (i.e., A R ).
Relevance : The relevance between a candidate caption and a ground-truth reference based on the i-th region is computed by the cosine similarity of a C i and g i .We average similarity over all regions to get the relevance score of a candidate caption with respect to an image.
Extraness : The extraness of C is captured by performing an orthogonal projection of a C i to g i , which returns the vertical context vector a C i⊥ to represent the irrelevant content of C to the ground truth at the i th region.
To avoid potential disturbance due to correlated feature vectors, we measure the Mahalanobis distance between the vertical context vector a C i⊥ and its original context vector a C i (see Eq. ( 7)).Notice that a small distance value indicates that the irrelevant context vector a C i⊥ is closed to the original context vector a C i .In other words, the original context contains a large amount of extra information.Therefore, the higher this metric is, the less extra information the caption contains.
Omission : The measurement of omission is similar to that of extraness, where we capture the missing information of C by the vertical context features g i⊥ based on the orthogonal projection of g i to a C i .The omission score is denoted as O. Similarly, the higher the omission score is, the less missing information the caption contains.
Considering that each image may have multiple reference captions, we further average the score of the aforementioned three aspects over all reference captions while considering A R as ground truth.

Experimental Setup
We perform experiments on three humanevaluated caption sets.
Composite Dataset (Aditya et al., 2015) contains the candidate captions of images from MS-COCO, Flickr8k, and Flickr30k.Captions were generated by humans and two caption models (11,985 instances in total).Human judgments for these candidate captions was provided on a 5-point scale rating that represents description correctness.PublicSys Dataset has 2,500 captions collected by Rohrbach et al., which were generated by five state-of-the-art captioning systems on 500 MS-COCO images, respectively.Human grading was done on a 5-point scale based on annotators' preferences to descriptions.Pascal-50S Dataset (Vedantam et al., 2015) includes 4000 caption pairs that describe images from the UIUC PASCAL Sentence dataset.Each annotator was asked to select one sentence per pair that is closer to the expression of the given reference sentence.Candidate pairs were grouped into four categories: 1) human-human correct (HC, i.e., a pair of captions are written by humans for the same image), 2) human-human incorrect (HI, i.e., Two human-written captions of which one describes another image instead of the target image), 3) human-machine (HM, i.e., two captions are generated by a human and a machine, respectively.),and 4) machine-machine (MM, i.e., both machine-generated caption).
Following standard practice (Anderson et al., 2016;Elliott and Keller, 2014)  Text in red is extra information, while text in green is missing information.sim(x, y) is the average similarity between machine-identified and true error vectors over image regions.
rule-based metrics (see Table 1).All existing metrics were implemented with the MS-COCO evaluation tool 2 .The performance of metrics was assessed via Kendall's tau (τ ) rank correlation for the scoring-based datasets (i.e., Composite & PublicSys) and accuracy of the pairwise comparison for the Pascal-50S dataset.

Experimental Results
Can extra & missing information be captured?
In order to measure the effectiveness of error identification (i.e., extraness and omission), we randomly sampled a subset of data, and manually identify the actual extraness (i.e., E C ) and true omission (i.e., O C ) of each candidate caption.We conduct validation based on the average co-2 https://github.com/tylin/coco-captionsine similarity between the machine-identified error (i.e., a C i⊥ &g i⊥ ) and true error description.Figure 3 provides two illustrative examples of the validation process.Phrases highlighted in red (e.g.,"on a table") are extra information (more text in the candidate caption than in the ground truth).Meanwhile, phrases in green (e.g., "mashed potatoes") are missing from the candidate description, but occur in the image and the reference caption.We observe that machine-identified errors are highly similar to the true error information in both cases (≥ 0.65).This result suggests that our method can capture extraness and omission from an image caption.Do error-aware evaluation metrics help?The results of metric performance in Table 1 show that overall, using the three metrics proposed in REO, especially extraness and omission, led to a noticeable improvement in Kendall tau's correlation compared to the best reported results based on prior metrics.Our results suggest that human evaluation tends to be more sensitive to the irrelevance than the relevance of a candidate caption regarding ground truth.We also find that jointly considering both images and human-written references contributes more to caption evaluation than each of the two data sources alone -except for the case of HC pair comparison.This exception can be explained by the phenomenon that humanwritten descriptions are flexible in terms of word References: Ø Some baseball players are playing baseball on a field.Ø A processional baseball game with a player ge8ng ready to swing a bat.Ø A ba9er ready for a pitch at a baseball game.Ø A baseball player is ready to hit a ball at a game.Ø A professional baseball game shot of the ba9er wai;ng for a pitch.

Candidate:
• Sys1: A ba9er catcher and umpire during a baseball game.• Sys2: A crowd watches as a crowd watches as the crowd watches.choice and sentence structure, and such diversity may lead to the challenge of comparing a reference to a candidate in cases where both captions were provided by humans.By further looking into each considered aspect in REO, we find that extranesss metric is more appropriate to evaluate machine-generated captions (e.g., the PublicSys dataset and the MM pairs of Pascal-50S dataset), while the omission metric can be a better choice to access caption quality when the testing data consists of human-written descriptions.

What can we learn from the metric outputs?
To analyze our metric scores more in depth, we compare the outputs of five captioning systems on a set of images in the PublicSys dataset.Figure 4 shows an illustrative example.To make the scale of human grading and REO metrics comparable, we normalize scores per metric by using max-min normalization.
We find that metrics calculated in cases where the ground truth contained both the target image and human references are more likely to identify expression errors.For example, though the phrase "a crowd watches" in the caption of system 2 is relevant to the image, this phrase is repeated by three times in a sentence.As a result, the scores for relevance and extraness are decreasing when the ground truth involves references.Also, metrics focusing only on image content return higher values when the testing captions provide a high-level description of the whole image (e.g., captions of system 3 and 4) compared to the detailed captions for a specific image part (e.g., captions of system 1 and 5 focus on the baseball player).By comparing the herein considered three aspects of each caption, we observe that a caption that mainly focuses on describing a part of image in detail boosts relevance, but the sentence achieves a reduced metric score in terms of omission.

Conclusion
This paper presents a fine-grained, error-aware evaluation method REO to measure the quality of machine-generated image captions according to three aspects of descriptions: relevance regarding ground truth, extra description beyond image content, and omitted ground truth information.Comparing these metrics to alternative metrics, we find that our proposed solution produces evaluations that are more consistent with the assessment of human judges.Moreover, we find that human judgment tends to penalize extra and missing information (false positives and false negatives) more than it appreciates relevant content.Finally, and to no surprise, we conclude that using a combination of image content and human-written references as ground truth data allows for a more comprehensive evaluation than using either type of information separately.Our method can be extended to evaluate other text generation tasks.

•
Figure 1: An example of caption evaluation.Given two caption candidates, even though Caption A covers more image information than Caption B (e.g., missing "kids"), Caption A contains extra irrelevant description (highlighted in red).Prior metrics (e.g., BLEU4) only provide an overall quality score, which is difficult to infer specific description mistakes in a caption.In contrast, REO provides three indicators (i.e., relevance, extraness, and omission) that can properly achieve a finegrained assessment for each caption.

Figure 3 :
Figure 3: Examples of validating error identification.Text in red is extra information, while text in green is missing information.sim(x, y) is the average similarity between machine-identified and true error vectors over image regions.

•
Sys3: A baseball game in progress with a crowd watching.• Sys4: A group of people watching a baseball game.•Sys5: A baseball player swinging a bat at a ball.

Figure 4 :
Figure 4: Case study of REO metric scores.Candidates are highlighted in: 1) green: a detailed but incompleted caption, 2) red: repetition, 3) yellow: high-level description, and 4) blue: extra information not shown in the image.

Table 1 :
, we compared with Caption-level correlation between metrics and human grading scores in Composite and PublicSys dataset by using Kendall tau (τ ).All p-values < 0.01.For PASCAL-50S, we display the accuracy of metrics at matching human judgments with 5 reference captions per image.The highest value per column is in bold font.Column titles are explained in Section 3.1.Ground truth refers to two points of information: human-written references and images.