Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations

A major challenge in visually grounded language generation is to build robust benchmark datasets and models that can generalize well in real-world settings. To do this, it is critical to ensure that our evaluation protocols are correct, and benchmarks are reliable. In this work, we set forth to design a set of experiments to understand an important but often ignored problem in visually grounded language generation: given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance? Empirically, we study several multi-reference datasets and corresponding vision-and-language tasks. We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task; that metric-wise, CIDEr has shown systematically larger variances than others. Our evaluations on reference-per-instance shed light on the design of reliable datasets in the future.


Introduction
Natural Language Generation (NLG) is a challenging problem in Natural Language Processing (NLP)-the complex nature of NLG tasks arise particularly in the output space. In contrast to text classification or regression problems with finite output space, generation could be seen as a combinatorial optimization problem, where we often have exponentially many options |V | (here |V | is the size of the vocabulary and is the sentence length). With the advances of both Computer Vision and NLP techniques in deep learning, there have been growing interests in visually grounded NLG tasks, such as image captioning (Hodosh et al., 2013;Young et al., 2014;Lin et al., 2014;Vedantam et al., 2015), video captioning (Xu et al., 2016; 1. This group of folks comprising runners and bikers, some wearing identifying numbers, look like they are getting ready for a marathon. 2. A runner in yellow has a convoy of motorcycles following behind him on a highway as bystanders watch. 3. Marathon runners are running down a street with motorcyclists nearby. 4. A runner in the middle of a race running along side the road. 5. A man in a yellow shirt is running in a race.
1. This group of folks comprising runners and bikers, some wearing identifying numbers, look like they are getting ready for a marathon. 2. A runner in yellow has a convoy of motorcycles following behind him on a highway as bystanders watch. 3. A man in a yellow shirt is running in a race.
Figure 1: An image with three parallel captions from the Flickr30k dataset. Words in the same colors refer to the same objects. Wang et al., 2019;Chen and Dolan, 2011) and visual storytelling (Huang et al., 2016). For example, Figure 1 shows an example of image captioning from the popular Flickr30k dataset. In this paper, instead of crunching numbers and modifying model architectural designs to achieve new "state-of-the-art" results on leaderboards, we focus on re-assessing the current practices in visually grounded language generation research, including problems, datasets, evaluations, and tasks, from the sample variance angle. Given the differences in annotators' utility function and human visual attention models, how could the sample variance in captions teach us building robust and reliable visually grounded language generation agents?
More specifically, we empirically investigate the variance among the multiple parallel references in different datasets, and its effect on the training performance and evaluation result of corresponding tasks. We further study the number of references per visual instance, and how it affects the training and testing performance. A simple search in ACL Anthology and CVF Open Access Site shows that 58 out of 60 papers on vision-based text generation do not report variance in experimental results, while they often claim that their methods outperform previous state-of-the-art. Our evaluation suggests that the variance cannot be ignored and must be reported, and that CIDEr (Vedantam et al., 2015) has shown higher variance than other metrics. Fi-nally, introducing more training visual instances in the image and video captioning task on MS COCO and VATEX results in better performance on automatic metrics, while the visual storytelling task in VIST favors more references in the training set. For future dataset collection, we recommend the inclusion of more references when each reference is distinctive and complicated.

Research Questions and Settings
To understand sample variance, we conduct a series of experiments on multiple visually grounded NLG datasets, aiming to answer the following questions: 1. How different are the text references from their parallel pairs?

2.
How greatly do different selections of references during either training or testing affect the final evaluation results?
3. To train a more reliable model, shall we collect more visual instances with limited references or more parallel references for each instance given a fixed budget?
We focus on multi-reference visually grounded NLG tasks where each visual instance is paired with multiple parallel text references. Below we describe the datasets we investigate into, the models used for training, and the metrics for evaluation.

Models
We apply an implementation 1 of Xu et al. (2015) for image captioning. We implement the Enc-Dec baseline model proposed by Wang et al. (2019) for video captioning. For visual storytelling, we use the AREL model 2 proposed by Wang et al. (2018).
Metrics We utilize six automatic metrics for natural language generation to evaluate the quality of the generated text, including BLEU (Papineni   (Lin, 2004), METEOR (Elliott and Keller, 2013), CIDEr (Vedantam et al., 2015), SPICE (Anderson et al., 2016) and the most recent BERTScore (Zhang* et al., 2020) that is based on the pretrained BERT model. We use nlg-eval 3 (Sharma et al., 2017) for the calculation of BLEU, METEOR, ROUGE L and CIDEr. Note that we applied a patch 4 and choose to use IDF from the MSCOCO Vaildation Dataset when calculating consensus CIDEr score for each dataset. We use the authors' releases for SPICE 5 and BERTScore 6 . BERTScore has been rescaled with baseline scores.

Reference Variance within Datasets
In this section, we examine the sample variance among text references within seven visually grounded NLG datasets. To quantify the sample variance, we define a consensus score c among n parallel references R = {r i } n i=1 (where r i is the i-th text reference) for each visual instance: where metric can be any metric in the above section. The consensus score represents the agreement among the parallel references for the same visual instance. Since the number of parallel references varies across datasets, we randomly sample 5 parallel references per instance (the minimum n all datasets used) for a fair comparison. For datasets with more than 5 parallel references per instance, we repeat 10 times and take the average.

Reference CIDEr
A man riding an elephant in a river. 225 A man in a brown shirt rides an elephant into the water. 227 A man rides an elephant into a river. 266 A man riding an elephant into some water of a creek. 271 Man riding an elephant into water surrounded by forest. 277 There are many taxi cabs on the road 4 Heavy city traffic all going in one direction 26 Many cars stuck in traffic on a high way 28 This shot is of a crowded highway full of traffic 28 A city street with lots of traffic and lined with buildings 35 Table 3: Two group of references from MSCOCO dataset and the CIDEr score for each reference within their group. The consensus CIDEr score for the two groups of references are 253.2 and 24.2 respectively. Table 2 shows the evaluation results. Noticeably, the datasets for the same task have similar consensus BERTScore, which is embeddingbased (Kilickaya et al., 2017). Image captioning datasets score the highest on BERTScore consensus, video captioning datasets rank the second, while VIST for visual storytelling has the lowest consensus BERTScore. The descending consensus BERTScore order coincides with task difficulties. Video captioning is more complicated than image captioning due to its dynamic nature. Visual storytelling is even more challenging with the diverse and sophisticated stories in creative writing. Having the lowest consensus scores on all metrics indicates that VIST is a very challenging dataset. Moreover, we notice that CIDEr has the largest standard deviation (both absolutely and relatively) on consensus scores for all datasets. This suggests that CIDEr might be unstable and sensitive to the selection of references. Table 3 takes a closer look at the high variance of the consensus CIDEr score. By definition, CIDEr score computes cosine similarity between the Term Frequency Inverse Document Frequency (TF-IDF) (Robertson, 2004) weighted n-grams. The reasons for the consensus CIDEr score to have high standard deviation are threefold: (1) N-grams with similar meanings might have totally different TF-IDF weights. Therefore, the CIDEr score is sensitive to word selection and sentence structure. (2) Token frequency differs across datasets. The consensus CIDEr score in Table 2 is calculated on the sentence level. We follow previous work and use IDF from the MSCOCO validation set for reliable results. In the MSCOCO validation set, 'man', 'elephant', and 'river' have more exposure, while 'traffic' and 'highway' are less mentioned. As a result, the first group of references has a much higher consensus CIDEr score than the second group. (3) Moreover, different from other metrics that scale from 0-1, the CIDEr score scales from 0-10. The enlarged scale also contributes to its salient variance.

Effect of Testing Sample Variance
Previous studies on automatic metrics (Vedantam et al., 2015;Anderson et al., 2016) show that more testing references lead to better evaluation accuracy.
Here we aim at examining the effect of using different references for testing. Given n references per visual, we incrementally set the testing RPI as 1, 2, ..., n − 1, and randomly sample the testing references from all of the n references. For each RPI, the random sampling and evaluation process is conducted for 20 times. The model is trained on the complete training set.
In Figure 2, we demonstrate the experiments on PASCAL-50s for image captioning and VATEX en for video captioning, where the standard deviation of evaluation scores on those metrics are plotted over RPI. For all metrics, the standard deviation shrinks as more references are employed for testing, indicating the evaluation bias caused by sample variance may be mitigated by introducing more parallel references. However, most of the existing datasets have far less than 50 references. For example, according to Wang et al. (2019), 12 out of 15 datasets for video captioning have less than 3 parallel text references per video, but the vari-ance on those metrics under 3 RPI is very high. This casts doubt on the reliability of the model's performance. For fairer model comparison, we hereby encourage researchers to (1) provide the evaluation set with more parallel references when collecting new datasets, and (2) report the variance of the model's metric scores as well when comparing to other models. Noticeably, the variance of the model's performance on CIDEr is significantly larger than on other metrics, which supplements the previous finding in Section 3 that CIDEr is very sensitive to the reference sample variance.

Effect of Training Sample Variance
To investigate the effect of training sample variance, we train the models with different training RPI, from 1 to n − 1. Similarly, we randomly sample the training references from n references. For each RPI, we repeat the random sampling and training process for 10 times on each dataset. The evaluation is conducted on the complete test set. Figure 3 depicts the performance of BLEU, CIDEr and BERTScore on each dataset when the corresponding model is trained with different RPI. While the performance on all datasets improves with the increase of training RPI, experimental results show salient variance on all metric scores when the amount of training data is insufficient, which indicates the selection of training samples will influence the final performance. Furthermore, VIST displays notable score deviation on all three metrics, which suggests visual storytelling to be sensitive to the selection of training data. Figure 4: Performance when trained with varying training RPI on a fixed total number of visual-text sample pairs. Results on the captioning datasets COCO and VATEX en are in favor of more visual diversity, while the visual storytelling model benefits more from more parallel text references. many parallel references do we need to train a reliable model for visual-grounded text generation? Here we study the balance between the number of visual instances and the number of parallel text references in the datasets, and how these two factors affect the training performance for each task.
For each task, we fix the total number of training data samples (i.e., unique visual-reference pairs), and set the training RPI to be 1, 2, ..., n. We have #sample = #visual instance * RP I. More specifically, we train the image captioning model on MS COCO with 82,740 samples, and use 25,200 and 7,980 samples for training in the video captioning task and visual storytelling task respectively. Figure 4 illustrates the evaluation results for each task. For each RPI, we repeat the random sampling and training process for 10 times on each dataset. As the training RPI increases, the performance of the image captioning model and video captioning model declines on all four metrics, while the visual storytelling performance improves. This suggests that introducing more visual instances during training is beneficial for the captioning tasks, where the parallel references are all objective descriptions regarding the same visual. In contrast, the stories in VIST are more expressive and may refer to imaginary contents (Wang et al., 2018), leading to a much larger search space during generation. In this case, introducing more parallel references into training may help to train a more stable and better-performing storytelling model.

Conclusion
We study the sample variance in visually-grounded language generation, in terms of reference sample variance within datasets, effects of training or testing sample variance on metric scores, and the trade-off between the visual instance number and the parallel reference number per visual. Along with some intriguing findings, we urge researchers to report sample variance in addition to the metric scores when comparing models' performance. We also recommend that when collecting a new dataset, the test set should include more parallel references for fair evaluation, while for the training set, when the text generations are expected to be distinctive and complicated, more parallel references should be collected otherwise a larger variety of visual appearances is more favorable.