ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT

In this paper, we propose an evaluation metric for image captioning systems using both image and text information. Unlike the previous methods that rely on textual representations in evaluating the caption, our approach uses visiolinguistic representations. The proposed method generates image-conditioned embeddings for each token using ViLBERT from both generated and reference texts. Then, these contextual embeddings from each of the two sentence-pair are compared to compute the similarity score. Experimental results on three benchmark datasets show that our method correlates significantly better with human judgments than all existing metrics.


Introduction
Image captioning is a task that aims to generate a text that describes a given image. While there have been many advances for caption generation algorithms (Vinyals et al., 2015;Anderson et al., 2018) and target datasets (Fang et al., 2015;Sharma et al., 2018), few studies have focused on assessing the quality of the generated captions with consideration to the image.
Most of the previous studies on evaluating image captioning tasks rely on n-gram similarity metrics such as BLEU (Papineni et al., 2002) or CIDEr (Vedantam et al., 2015). These approaches bear limitations in dealing with the text's diverse nature, similarly found in other text generation tasks (e.g., abstractive summarization and dialog) (Kryscinski et al., 2019;Liu et al., 2016). To alleviate the issues in the n-gram based approaches, researchers proposed word embedding-based techniques (Kusner et al., 2015;Zhao et al., 2019;Lo, 2019;Clark et al., 2019). These techniques shows robust performance and achieve higher correlation with human judgment than that of other previous metrics in many text generation tasks, including image captioning. Especially, BERTScore  shows that using contextualized embedding is effective for evaluating the text. As BERTScore does not utilize image content, it is still undiscovered how to effectively utilize the image content in the process of evaluating the captions. To further reflect image context while utilizing the advantages of BERTScore, we propose ViL-BERTScore 1 by employing the ViLBERT (Lu et al., 2019), which is a task-agnostic pre-trained visiolinguistic representation. ViLBERTScore computes cosine similarity between token embeddings for reference and candidate sentences similar to BERTScore. However, different from BERTScore, the token embedding is computed with the consideration of image contexts.
We evaluate our proposed method on three benchmark datasets (i.e., Composite, Flickr8k, and PASCAL-50S). Extensive experiments show that ViLBERTScore achieves a significantly higher correlation with human judgments than previous metrics. This result demonstrates that the use of contextualized embedding from vision and language is  Figure 2: Overall computation of ViLBERTScore. Given the image I, reference caption x and candidate captionx, we compute contextual embeddings with ViLBERT for x andx respectively. Then, we extract the text embeddings H X V and HX V for each output embedding. Finally, we compute the pairwise cosine similarity between H X V and HX V as in . effective in evaluating image captioning tasks.
2 Related Work

Caption Evaluation
We provide a summary of the widely used metrics for evaluating image captions such as n-gram similarity metrics, embedding based metrics, and other task-specific metrics for captioning.

N-gram Similarity Metrics
The most widely used metrics for evaluating the quality of text generation tasks are n-gram similarity metrics that compute the exact number of n-gram matches between reference and generated text. One example of these metrics is BLEU (Papineni et al., 2002) that computes the precision of overlap n-gram between reference and candidate. ROUGE (Lin, 2004) is a set of commonly used metrics for text summarization.
In particular, ROUGE-N, the longest common subsequence based metric, is the most frequently used variants of ROUGE. CIDEr (Vedantam et al., 2015), which is proposed for evaluating image captions, computes the tf-idf weighted n-gram similarity between reference and candidate.
Embedding Based Metrics The n-gram similarity metrics possess critical limitations; they cannot count the synonym matches of the ngram, even though the synonyms are widely found in the generated text. To overcome this weakness, embedding based metrics such as Word Mover Distance(WMD) (Kusner et al., 2015) and BERTScore  are proposed.
WMD computes minimum transportation distance among tokens using pre-trained word embeddings (i.e., GloVe (Pennington et al., 2014)). On the other hand, BERTScore computes cosine similarity among tokens using contextual embeddings from BERT (Devlin et al., 2019).  (Jiang et al., 2019) uses the output of the visual grounding task. BERT-TBR (Yi et al., 2020) focuses on the variance of the captions and combine multiple reference captions to get improved BERTScore.

ViLBERT
To compute contextual representations from the visually-grounded text, researchers proposed a transformer-based model. One such example is ViLBERT (Lu et al., 2019), which is a task-agnostic pre-trained representation for vision and language. As shown in Fig. 1, ViLBERT employs two streams of transformer (Vaswani et al., 2017)-based architecture; one of each part processes visual and textual inputs, respectively. Specifically, the image and grounded-text inputs are fed into separate embedding layers; followed by two co-attentional transformer block that allows interaction between the two modalities. ViLBERT is pre-trained with two training objectives, masked multi-modal modeling, and multi-modal alignment. Lu et al. (2019) show that fine-tuning this pre-trained ViLBERT to visionand-language related downstream tasks (e.g., visual question answering (Antol et al., 2015)) significantly outperforms previous approaches. Recently, Lu et al. (2020) investigate and reveal that training the ViLBERT with multi-task learning objectives provides further performance improvement for most of the vision and language tasks.

ViLBERTScore
We propose ViLBERTScore, a metric that utilizes visually-grounded representations for each token.
The overall flow of our proposed ViLBERTScore is described in Fig. 2. Similar to BERTScore, we first compute contextual embeddings of both reference caption X = (x 1 , ..., x n ) and candidate captionX = (x 1 , ...,x m ). Since we use ViLBERT, we compute the embeddings for each caption conditioning with the target image I. For the target image, we extract N region-level features V = (v 1 , ..., v N ) using pre-trained object detection model (see 4.2 for detailed information). Then, we feed each pair of image and caption embeddings (X, V ), (X, V ) to pre-trained ViLBERT and compute the contextual embeddings (H V X , H X V ) and (H VX , HX V ). Note that H V and H X are image and text embeddings, respectively. Among these embeddings, we only utilize the text embeddings, H X V = (h w0 , ..., h wT ) and HX V = (ĥ w0 , ...,ĥ wT ), and compute cosine similarity among the pair of tokens from the candidate and reference caption. Finally, the greedy matching process is exercised to the pair of tokens mentioned above for finding the most similar tokenmatch between two sentences. We can formulate ViLBERTScore as follows.   (Lu et al., 2020), which is fine-tuned on 12 downstream tasks. Scores with † are cited from (Yi et al., 2020).  each candidate caption and image pair. The images in this dataset are from Flickr8k (Hodosh et al., 2013), Flickr30k (Plummer et al., 2017, and COCO captions (Lin et al., 2014). The human judgments scores range from 1 to 5, depending on the relevance between candidate caption and image.
Flickr8k Flickr8k dataset is composed of 8,092 images with five corresponding human-generated captions. This dataset also provides three expert annotations for each image and candidate caption on 5,822 images. The score ranges from 1 to 4, depending on how well the caption and image match.

Kendall Correlation Across Layers
ViLBERTScore * R C ViLBERTScore * R F ViLBERTScoreR C ViLBERTScoreR F Figure 3: Kendall Correlation between human judgments across different layers. C and F are the results for Composite and Flickr8k datasets, respectively. Note that ViLBERTScore* uses the fine-tuned ViLBERT model from (Lu et al., 2020). tions generated by humans for each image. Different from other datasets, this dataset provides 4,000 caption triplet <A, B, C> composed of 50 reference captions(A) and two candidate captions(B, C) for the given image. There are human annotated answers to which is more similar to "A", "B" or "C". Candidate captions are human-written or modelgenerated.

Implementation Details
We use two versions of ViLBERT, one from the pre-trained ViLBERT model from (Lu et al., 2019) and the other version from (Lu et al., 2020) that are fine-tuned on 12 downstream tasks. We set N = 100 boxes for each image using image detectron model (He et al., 2017) to compute contextual embedding as in (Lu et al., 2019). We use the textual representations in the 6-th layer, the last coattention layer, of ViLBERT for the main results in Table 1 and Table 2. For the dataset containing multiple reference captions, we average the score over the pairs of candidate caption and reference captions.

Evaluation Methods
We compute Kendall's correlation coefficient with human judgments for the Composite dataset and Flickr8k dataset. For the PASCAL-50S dataset, we compute the number of matches between human judgments for each candidate caption pair.

Performance Comparison
We present the correlation scores for the baseline metrics and our proposed ViLBERTScore for Composite dataset and Flickr8k dataset in Table 1. ViLBERTScore shows a higher correlation than all the existing metrics. For the PASCAL-50S dataset, Table 2 shows that ViLBERTScore R is the best metric at comparing captions among all of the metrics. Interestingly, we observe that the performance of ViLBERTScore P is lower than that of ViLBERTScore R for the PASCAL-50S dataset. This is consistent behavior with the results of . We speculate that the main objects in the image are the most critical words the human judgments as in .
We further explore the performance of ViL-BERTScore with different base model. We choose another ViLBERT model that is fine-tuned on 12 vision-and-language related tasks (see ViL-BERTScore* in Table 1 and 2). This model shows better results than ViLBERTScore. We explain that some of the tasks such as image retrieval or visual entailment (Xie et al., 2019) are related to caption evaluation.
Correlation Across Layers The co-attentional block in ViLBERT is composed of six layers. To verify the effectiveness of each layer in computing the contextualized embedding of the data, we compute ViLBERTScore using the outputs of different layer. As shown in Fig. 3, the outputs of a higher layer show a better correlation with human judgments than the lower layer except for the last layer. This observation reveals that blending information among the modalities is essential in computing better contextual representations. We explain that the correlation drops in the last layer because the last layer has task specific property.

Conclusion
In this paper, we propose ViLBERTScore, a metric for image captioning task by using pre-trained visio-linguistic representations. Different from the BERTScore, ViLBERTScore utilizes image conditional embeddings for each token which is critical in evaluating vision-language combined task. Empirical results on Composite, Flickr8k, and PASCAL-50S datasets show that the proposed ViL-BERTScore correlates better with human judgments than all of the previous metrics.