A Survey of Current Datasets for Vision and Language Research

Integrating vision and language has long been a dream in work on artificial intelligence (AI). In the past two years, we have witnessed an explosion of work that brings together vision and language from images to videos and beyond. The available corpora have played a crucial role in advancing this area of research. In this paper, we propose a set of quality metrics for evaluating and analyzing the vision&language datasets and categorize them accordingly. Our analyses show that the most recent datasets have been using more complex language and more abstract concepts, however, there are different strengths and weaknesses in each.


Introduction
Bringing together language and vision in one intelligent system has long been an ambition in AI research, beginning with SHRDLU as one of the first vision-language integration systems ( Winograd, 1972) and continuing with more recent attempts on conversational robots grounded in the visual world (Kollar et al., 2013;Cantrell et al., 2010;Matuszek et al., 2012;Kruijff et al., 2007;Roy et al., 2003).In the past few years, an influx of new, large vision & language corpora, alongside dramatic advances in vision research, has sparked renewed interest in connecting vision and language.Vision & language corpora now provide alignments between visual content that can be recognized with Computer Vision (CV) algorithms and language that can be understood and generated using Natural Language Processing techniques.
Fueled in part by the newly emerging data, research that blends techniques in vision and in language has increased at an incredible rate.In just the past year, recent work has proposed methods for image and video captioning (Fang et al., 2014;Donahue et al., 2014;Venugopalan et al., 2015), summarization (Kim et al., 2015), reference (Kazemzadeh et al., 2014), and question answering (Antol et al., 2015;Gao et al., 2015), to name just a few.The newly crafted large-scale vision & language datasets have played a crucial role in defining this research, serving as a foundation for training/testing and helping to set benchmarks for measuring system performance.
Crowdsourcing and large image collections such as those provided by Flickr1 have made it possible for researchers to propose methods for vision and language tasks alongside an accompanying dataset.However, as more and more datasets have emerged in this space, it has become unclear how different methods generalize beyond the datasets they are evaluated on, and what data may be useful for moving the field beyond a single task, towards solving larger AI problems.
In this paper, we take a step back to document this moment in time, making a record of the major available corpora that are driving the field.We provide a quantitative analysis of each of these corpora in order to understand the characteristics of each, and how they compare to one another.The quality of a dataset must be measured and compared to related datasets, as low quality data may distort an entire subfield.We propose a set of criteria for analyzing, evaluating and comparing the quality of vision & language datasets against each other.Knowing the details of a dataset compared to similar datasets allows researchers to define more precisely what task(s) they are trying to solve, and select the dataset(s) best suited to their goals, while being aware of the implications and biases the datasets could impose on a task.
We categorize the available datasets into three major classes and evaluate them against these cri-teria.The datasets we present here were chosen because they are all available to the community and cover the data that has been created to support the recent focus on image captioning work.More importantly, we provide an evolving website2 containing pointers and references to many more vision-to-language datasets, which we believe will be valuable in unifying the quickly expanding research tasks in language and vision.

Quality Criteria for Language & Vision Datasets
The quality of a dataset is highly dependent on the sampling and scraping techniques used early in the data collection process.However, the content of datasets can play a major role in narrowing the focus of the field.Datasets are affected by both reporting bias (Gordon and Durme, 2013), where the frequency with which people write about actions, events, or states does not directly reflect real-world frequencies of those phenomena; they are also affected by photographer's bias (Torralba and Efros, 2011), where photographs are somewhat predictable within a given domain.This suggests that new datasets may be useful towards the larger AI goal if provided alongside a set of quantitative metrics that show how they compare against similar corpora, as well as more general "background" corpora.Such metrics can be used as indicators of dataset bias and language richness.At a higher level, we argue that clearly defined metrics are necessary to provide quantitative measurements of how a new dataset compares to previous work.This helps clarify and benchmark how research is progressing towards a broader AI goal as more and more data comes into play.
In this section, we propose a set of such metrics that characterize vision & language datasets.We focus on methods to measure language quality that can be used across several corpora.We also briefly examine metrics for vision quality.We evaluate several recent datasets based on all proposed metrics in Section 4, with results reported in Tables 1,  2, and Figure 1.

Language Quality
We define the following criteria for evaluating the captions or instructions of the datasets: • Vocabulary Size (#vocab), the number of unique vocabulary words.
• Syntactic Complexity (Frazier, Yngve) measures the amount of embedding/branching in a sentence's syntax.We report mean Yngve (Yngve, 1960) and Frazier measurements (Frazier, 1985); each provides a different counting on the number of nodes in the phrase markers of syntactic trees.
• Part of Speech Distribution measures the distribution of nouns, verbs, adjectives, and other parts of speech.
• Abstract:Concrete Ratio (#Conc, #Abs, %Abs) indicates the range of visual and non-visual concepts the dataset covers.Abstract terms are ideas or concepts, such as 'love' or 'think' and concrete terms are all the objects or events that are mainly available to the senses.For this purpose, we use a list of most common abstract terms in English (Vanderwende et al., 2015), and define concrete terms as all other words except for a small set of function words.
• Average Sentence Length (Sent Len.) shows how rich and descriptive the sentences are.
• Perplexity provides a measure of data skew by measuring how expected sentences are from one corpus according to a model trained on another corpus.We analyze perplexity (Ppl) for each dataset against a 5-gram language model learned on a generic 30B words English dataset.We further analyze pair-wise perplexity of datasets against each other in Section 4.

Vision Quality
Our focus in this survey is mainly on language, however, the characteristics of images or videos and their corresponding annotations is as important in vision & language research.The quality of vision in a dataset can be characterized in part by the variety of visual subjects and scenes provided, as well as the richness of the annotations (e.g., segmentation using bounding boxes (BB) or visual dependencies between boxes).Moreover, a vision corpus can use abstract or real images (Abs/Real).

The Available Datasets
We group a representative set of available datasets based on their content.For a complete list of datasets and their descriptions, please refer to the supplementary website. 2

Captioned Images
Several recent vision & language datasets provide one or multiple captions per image.The captions of these datasets are either the original photo title and descriptions provided by online users (Ordonez et al., 2011;Thomee et al., 2015), or the captions generated by crowd workers for existing images.The former datasets tend to be larger in size and contain more contextual descriptions.

User-generated Captions
• SBU Captioned Photo Dataset (Ordonez et al., 2011) contains 1 million images with original user generated captions, collected in the wild by systematic querying of Flickr.This dataset is collected by querying Flickr for specific terms such as objects and actions and then filtered images with descriptions longer than certain mean length.
• Déjà Images Dataset (Chen et al., 2015) consists of 180K unique user-generated captions associated with 4M Flickr images, where one caption is aligned with multiple images.This dataset was collected by querying Flickr for 693 high frequency nouns, then further filtered to have at least one verb and be judged as "good" captions by workers on Amazon's Mechanical Turk (Turkers).

Crowd-sourced Captions
• UIUC Pascal Dataset (Farhadi et al., 2010) is probably one of the first datasets aligning images with captions.Pascal dataset contains 1,000 images with 5 sentences per image.
• Flickr 30K Images (Young et al., 2014) extends previous Flickr datasets (Rashtchian et al., 2010), and includes 158,915 crowd-sourced captions that describe 31,783 images of people involved in everyday activities and events.
• Microsoft COCO Dataset (MS COCO) (Lin et al., 2014) includes complex everyday scenes with common objects in naturally occurring contexts.Objects in the scene are labeled using per-instance segmentations.In total, this dataset contains photos of 91 basic object types with 2.5 million labeled instances in 328k images, each paired with 5 captions.This dataset gave rise to the CVPR 2015 image captioning challenge and is continuing to be a benchmark for comparing various aspects of vision and language research.• Abstract Scenes Dataset (Clipart) (Zitnick et al., 2013) was created with the goal of representing real-world scenes with clipart to study scene semantics isolated from object recognition and segmentation issues in image processing.This removes the burden of low-level vision tasks.This dataset contains 10,020 images of children playing outdoors associated with total 60,396 descriptions.

Captions of Densely Labeled Images
Existing caption datasets provide images paired with captions, but such brief image descriptions capture only a subset of the content in each image.Measuring the magnitude of the reporting bias inherent in such descriptions helps us to understand the discrepancy between what we can learn for the specific task of image captioning versus what we can learn more generally from the photographs people take.One dataset useful to this end provides image annotation for content selection: • Microsoft Research Dense Visual Annotation Corpus (Yatskar et al., 2014) provides a set of 500 images from the Flickr 8K dataset (Rashtchian et al., 2010) that are densely labeled with 100,000 textual labels, with bounding boxes and facets annotated for each object.This approximates "gold standard" visual recognition.
To get a rough estimate of the reporting bias in image captioning, we determined the percentage of top-level objects3 that are mentioned in the captions for this dataset out of all the objects that are annotated.Of the average 8.04 available top-level objects in the image, each of the captions only reports an average of 2.7 of these objects. 4A more detailed analysis of reporting bias is beyond the scope of this paper, but we found that many of the biases (e.g., people selection) found with abstract scenes (Zitnick et al., 2013) are also present with photos.

Video Description and Instruction
Video datasets aligned with descriptions (Chen et al., 2010;Rohrbach et al., 2012;Regneri et al., 2013;Naim et al., 2015;Malmaud et al., 2015) generally represent limited domains and small lexicons, which is due to the fact that video processing and understanding is a very compute-intensive task.Available datasets include: • Short Videos Described with Sentences (Yu and Siskind, 2013)  .The descriptions are one sentence summaries about the actions or events in the video as described by Amazon Turkers.In this dataset, both paraphrase and bilingual alternatives are captured, hence, the dataset can be useful translation, paraphrasing, and video description purposes.

Beyond Visual Description
Recent work has demonstrated that n-gram language modeling paired with scene-level understanding of an image trained on large enough datasets can result in reasonable automatically generated captions (Fang et al., 2014;Donahue et al., 2014).Some works have proposed to step beyond description generation, towards deeper AI tasks such as question answering (Ren et al., 2015;Malinowski and Fritz, 2014).We present two of these attempts below: • Visual Madlibs Dataset (VML) (Yu et al., 2015) is a subset of 10,783 images from the MS COCO dataset which aims to go beyond describing which objects are in the image.For a given image, three Amazon Turkers were prompted to complete one of 12 fill-in-the-blank template questions, such as 'when I look at this picture, I feel -', selected automatically based on the image content.This dataset contains a total of 360,001 MadLib question and answers.
• Visual Question Answering (VQA) Dataset (Antol et al., 2015) is created for the task of openended VQA, where a system can be presented with an image and a free-form natural-language question (e.g., 'how many people are in the photo?'), and should be able to answer the question.This dataset contains both real images and abstract scenes, paired with questions and answers.Real images include 123,285 images from MS COCO dataset, and 10,000 clip-art abstract scenes, made up from 20 'paperdoll' human models with adjustable limbs and over 100 objects and 31 animals.Amazon Turkers were prompted to create 'interesting' questions, resulting in 215,150 questions and 430,920 answers.
• Toronto COCO-QA Dataset (CQA) (Ren et al., 2015) is also a visual question answering dataset, where the questions are automatically generated from image captions of MS COCO dataset.This dataset has a total of 123,287 images with 117,684 questions with one-word answer about objects, numbers, colors, or locations.

Analysis
We analyze the datasets introduced in Section 3 according to the metrics defined in Section 2, using the Stanford CoreNLP suite to acquire parses and part-of-speech tags (Manning et al., 2014).We also include the Brown corpus (Francis and Kucera, 1979;Marcus et al., 1999) as a reference point.We find evidence that the VQA dataset captures more abstract concepts than other datasets, with almost 20% of the words found in our abstract concept resource.The Deja corpus has the least number of abstract concepts, followed by COCO and VDC.This reflects differences in col- To make perplexities comparable, we used the same vocabulary frequency cutoff of 3.All models are 5-grams.
q q q q q q q q q q q Brown SBU Deja Pascal We include the POS tags from the balanced Brown corpus (Marcus et al., 1999) to contextualize any very shallow syntactic biases.We mapped all nouns to "N," all verbs to "V," all adjectives to "J" and all other POS tags to "O." lecting the various corpora: For example, the Deja corpus was collected to find specifically visual phrases that can be used to describe multiple images.This corpus also has the most syntactically simple phrases, as measured by both Frazier and Yngve; this is likely caused by the phrases needing to be general enough to capture multiple images.
The most syntactically complex sentences are found in the Flickr30K, COCO and CQA datasets.However, the CQA dataset suffers from a high perplexity against a background corpus relative to the other datasets, at odds with relatively short sentence lengths.This suggests that the automatic caption-to-question conversion may be creating unexpectedly complex sentences that are less reflective of general language usage.In contrast, the COCO and Flickr30K dataset's relatively high syntactic complexity is in line with their relatively high sentence length.
Table 2 illustrates further similarities between datasets, and a more fine-grained use of perplexity to measure the usefulness of a given training set for predicting words of a given test set.Some datasets such as COCO, Flickr30K, and Clipart are generally more useful as out-domain data compared to the QA datasets.Test sets for VQA and CQA are quite idiosyncratic and yield poor perplexity unless trained on in-domain data.As shown in Figure 1, the COCO dataset is balanced across POS tags most similarly to the balanced Brown corpus (Marcus et al., 1999).The Clipart dataset provides the highest proportion of verbs, which often correspond to actions/poses in vision research, while the Flickr30K corpus provides the most nouns, which often correspond to object/stuff categories in vision research.
We emphasize here that the distinction between a qualitatively good or bad dataset is task dependent.Therefore, all these metrics and the obtained results provide the researchers with an objective set of criteria so that they can make the decision whether a dataset is suitable to a particular task.

Conclusion
We detail the recent growth of vision & language corpora and compare and contrast several recently released large datasets.We argue that newly introduced corpora may measure how they compare to similar datasets by measuring perplexity, syntactic complexity, abstract:concrete word ratios, among other metrics.By leveraging such metrics and comparing across corpora, research can be sensitive to how datasets are biased in different directions, and define new corpora accordingly.

Figure 1 :
Figure1: Simplified part-of-speech distributions for the eight datasets.We include the POS tags from the balanced Brown corpus(Marcus et al., 1999) to contextualize any very shallow syntactic biases.We mapped all nouns to "N," all verbs to "V," all adjectives to "J" and all other POS tags to "O."

Table 1 :
Summary of statistics and quality metrics of a sample set of major datasets.For Brown, we report Frazier and Yngve scores on automatically acquired parses, but we also compute them for the 24K sentences with gold parses: in this setting, the mean Frazier score is 15.26 while the mean Yngve score is 58.48.

Table 2 :
Perplexities across corpora, where rows represent test sets (20k sentences) and columns training sets (remaining sentences).